I’m currently working on a multi-label classification project where phrases like “amazing support by the team” are classified into categories such as “support” and “team”. I have already trained a model for this task.
I’m looking for advice on the best ways to evaluate our model’s performance using Langsmith. Specifically, I want to implement a scoring system where:
A partial match (e.g., identifying “support” but not “team”) scores a certain value,
A perfect match (e.g., identifying both “support” and “team”) scores 1,
An irrelevant or incorrect classification scores a different specified value.
Does Langsmith provide built-in evaluators that can handle this type of scoring? If not, what would be the recommended approach to customizing our evaluation metrics to fit these criteria?
Thanks for your insights!
I tried cot_qa evaluator but it is not giving desired score for partial match.