I’ve developed a Graph RAG (Retrieval-Augmented Generation) pipeline that performs reasoning over a knowledge graph. Given a user query, the pipeline retrieves relevant nodes and relationships in the form of graph triples like:
node1 - [relationship] -> node2
I want to evaluate the quality of the pipeline’s output using supervised metrics such as precision@k, MRR, and nDCG. I’m currently considering two approaches:
-
Node-Based Evaluation: In this approach, for each query, I define a set of desired nodes (those containing the correct answer). The pipeline’s retrieved nodes are then compared to these desired nodes to compute metrics like precision@k, MRR, and nDCG.
-
Text-Based Evaluation Using LLM: Here, I plan to convert the graph triples into textual sentences using a language model (e.g., all-MiniLM-L6-v2). The desired answers would also be in text form. I would then compare the retrieved texts to the desired texts using cosine similarity of their embeddings. For instance:
def compute_similarity(text1, text2):
embeddings = get_embeddings([text1, text2])
cosine_sim = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
return cosine_sim
def precision_at_k(retrieved_texts, relevant_texts, k):
retrieved_at_k = retrieved_texts[:k]
relevant_scores = [max(compute_similarity(text, relevant_text) for relevant_text in relevant_texts) for text in retrieved_at_k]
relevant_at_k = [1 if score >= 0.5 else 0 for score in relevant_scores]
return precision_score([1] * len(relevant_at_k), relevant_at_k)
Which of these approaches would be more appropriate for evaluating the RAG pipeline, considering that the answers involve complex relationships? Also, if there’s a better method or metric for this scenario, I’d be glad to learn about it.
Thanks in advance for your insights!