I was learning the basic of LLM evaluation and the framework is to generate samples of question and answer. Later, the prediction returned from the model that need to be evaluated will be compared with the “correct answer”.
When we are using QAGenerationChain
of LangChain to generate the the sample pairs of question and answer, should we be using a model that is different with the one we need to evaluate?
Or else, will the result of evaluation be biased as we are using the same model to generate the correct answer?
If we need to select a different model for the generation, how do we select such model?