I am planning on using a LLM (say llama3) to extract training data via a prompt, and then using a smaller model with a CLS token to do a custom training to try and match the accuracy of the LLM. Suppose that I can run the prompt on 1M+ data (although I suspect I won’t need as many).
Prompt: Does the following sentence contain apples or oranges: Examples:
"<prompt> apples, oranges" -> apples, oranges
"<prompt> apple, orange" -> apples, oranges
"<prompt> apples, no oranges" -> apples
So my questions are:
- The last CLS type LLM I have seen is microsoft’s xtremedistil. Are these models still being used? If so what is the latest + greatest?
- Would it be better to use a sentence transformer and do classification?
- In my training set for the student model, I will remove the prompt from above, is there a risk of this method?
The way I see it, in the long run these models are smaller and will cost far less. Would appreciate any thoughts in general.