I want to compare two LLM model that answer a question and verify if the answer is correct. It’s challenging to determine with certainty if an answer is truly correct because the models might use synonyms. Comparing their response to the correct answer is therefore complicated. However, I’d like to know which metric would be best for checking this?
I had considered using the following metrics: BLEU, cosine similarity, and F1-score.
New contributor
Haki Gamer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.