RAG Evaluation: Beyond Precision/Recall

RAG Evaluation: Beyond Precision/Recall
"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.
Why Traditional Metrics Fall Short
Traditional IR (Information Retrieval) metrics:
Problem: Can't distinguish between good retrieval with bad answer, or mediocre retrieval with good answer.
Case 1: Good Retrieval, Bad Answer — 3 relevant documents retrieved (High Precision), but LLM distorts content in answer (Hallucination)
Case 2: Mediocre Retrieval, Good Answer — Only 1 relevant document retrieved (Low Precision), but that document enabled accurate answer
The Three Axes of RAG Evaluation
RAG systems should be evaluated on three axes:
Query → Retrieval → Generation
- Retrieval stage → Context Quality (Context Recall, Context Precision)
- Generation stage → Answer Quality (Faithfulness, Answer Relevance)
1. Context Quality
How well do retrieved documents match the question?
- Context Recall: Was necessary information retrieved?
- Context Precision: What fraction of retrieved docs are actually useful?
2. Answer Quality
How good is the generated answer?
- Faithfulness: Is the answer grounded in retrieved documents? (Hallucination check)
- Answer Relevance: Does the answer address the question?
3. End-to-End Quality
Final quality of the entire pipeline
- Answer Correctness: Is the answer actually correct? (Requires ground truth)
Core Metrics Deep Dive
1. Faithfulness
Are all claims in the answer supported by retrieved documents?
def compute_faithfulness(answer: str, contexts: List[str]) -> float:
"""
1. Extract individual claims from answer
2. Check if each claim is supported by context
3. Return ratio of supported claims
"""
claims = extract_claims(answer)
supported = 0
for claim in claims:
if is_supported_by_context(claim, contexts):
supported += 1
return supported / len(claims) if claims else 0Example:
Context: "Tesla cut prices by up to 20% on January 13, 2023."
Answer: "Tesla cut prices by 20% in January 2023, which caused competitors to lower their prices too."
Claims:
- "Tesla cut prices in January 2023" → Supported ✓
- "Prices cut by 20%" → Supported ✓
- "Competitors lowered prices" → Not in context ✗
Faithfulness = 2/3 = 0.67
Why it matters: Low Faithfulness = Hallucination risk
2. Answer Relevance
Does the answer actually answer the question?
def compute_answer_relevance(question: str, answer: str) -> float:
"""
1. Generate questions from the answer
2. Measure similarity between original and generated questions
"""
# Guess the question from just the answer
generated_questions = generate_questions_from_answer(answer, n=3)
# Similarity to original question
similarities = [
semantic_similarity(question, gen_q)
for gen_q in generated_questions
]
return np.mean(similarities)Example:
Question: "Who is Tesla's CEO?"
Answer: "Tesla is an electric vehicle company."
Generated Questions from Answer:
- "What kind of company is Tesla?"
- "What are electric vehicle companies?"
Low similarity to original → Low Answer Relevance
Why it matters: Detects when LLM ignored the question or gave tangential answer
3. Context Recall
Is the information needed for the answer present in retrieved documents?
def compute_context_recall(
ground_truth: str,
contexts: List[str]
) -> float:
"""
1. Extract key statements from ground truth
2. Check if each statement is supported by context
"""
gt_statements = extract_statements(ground_truth)
attributed = 0
for statement in gt_statements:
if any(supports(ctx, statement) for ctx in contexts):
attributed += 1
return attributed / len(gt_statements) if gt_statements else 0Example:
Ground Truth: "Sam Altman was fired on November 17, 2023, and returned on November 22."
Contexts Retrieved:
- [1] "Sam Altman fired from OpenAI (2023-11-17)"
- [2] "Microsoft CEO expressed support for Sam Altman"
Ground Truth Statements:
- "Sam Altman fired 2023-11-17" → Supported by Context 1 ✓
- "Sam Altman returned 2023-11-22" → Not found ✗
Context Recall = 1/2 = 0.5
Why it matters: Directly measures retrieval failure (missing necessary docs)
4. Context Precision
What fraction of retrieved documents actually contributed to the answer?
def compute_context_precision(
question: str,
answer: str,
contexts: List[str]
) -> float:
"""
Check if each retrieved context contributed to the answer
"""
useful = 0
for ctx in contexts:
if contributes_to_answer(ctx, question, answer):
useful += 1
return useful / len(contexts) if contexts else 0Why it matters: Too much noise confuses the LLM → degrades answer quality
Relationship Between Metrics
Question → Context Quality (Recall, Precision) → Answer Quality (Faithfulness, Relevance) → Answer Correctness
Practical Implementation: Using RAGAS
RAGAS is a framework for RAG evaluation that computes these metrics easily.
Installation and Basic Usage
# pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)
from datasets import Dataset
# Prepare evaluation data
eval_data = {
"question": ["Who is Tesla's CEO?"],
"answer": ["Elon Musk is Tesla's CEO."],
"contexts": [["Elon Musk is Tesla's CEO and founder."]],
"ground_truth": ["Elon Musk"] # Needed for Context Recall
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision,
]
)
print(results)Batch Evaluation
def evaluate_rag_batch(
questions: List[str],
rag_system,
ground_truths: List[str] = None
) -> pd.DataFrame:
"""Evaluate RAG system on multiple questions"""
results = []
for i, question in enumerate(questions):
# Run RAG
answer, contexts = rag_system.query(question)
# Evaluate
result = {
"question": question,
"answer": answer,
"faithfulness": compute_faithfulness(answer, contexts),
"relevance": compute_answer_relevance(question, answer),
"context_precision": compute_context_precision(
question, answer, contexts
),
}
if ground_truths:
result["context_recall"] = compute_context_recall(
ground_truths[i], contexts
)
results.append(result)
return pd.DataFrame(results)Evaluation Without Ground Truth: LLM-as-Judge
When ground truth is unavailable, use an LLM as evaluator.
Faithfulness Evaluation
FAITHFULNESS_PROMPT = """
Given the context and answer, determine if each claim in the answer
is supported by the context.
Context:
{context}
Answer:
{answer}
For each claim in the answer, respond with:
- Claim: [the claim]
- Verdict: [Supported/Not Supported]
- Evidence: [quote from context if supported]
Finally, provide the overall faithfulness score (0-1).
"""
def llm_faithfulness(answer: str, context: str, llm) -> float:
prompt = FAITHFULNESS_PROMPT.format(context=context, answer=answer)
response = llm.generate(prompt)
return parse_faithfulness_score(response)Answer Relevance Evaluation
RELEVANCE_PROMPT = """
Given the question and answer, rate how relevant the answer is
to the question on a scale of 0-1.
Question: {question}
Answer: {answer}
Consider:
- Does the answer address the question directly?
- Is the answer complete?
- Is there irrelevant information?
Score (0-1):
Reasoning:
"""
def llm_relevance(question: str, answer: str, llm) -> float:
prompt = RELEVANCE_PROMPT.format(question=question, answer=answer)
response = llm.generate(prompt)
return parse_relevance_score(response)Evaluation Strategy: When to Measure What
Metrics by Development Stage
Evaluation Set Design
# Include diverse question types
eval_set = {
"simple": [ # Single doc sufficient
"Who is Tesla's CEO?",
"When was OpenAI founded?",
],
"multi_hop": [ # Multiple docs needed
"What did Microsoft's CEO say when OpenAI's CEO was fired?",
],
"temporal": [ # Time reasoning required
"Who was CEO before Sam Altman returned?",
],
"comparison": [ # Comparison questions
"Which sold more in 2023, Tesla or BYD?",
],
"unanswerable": [ # Cannot be answered
"What are Tesla's 2025 sales figures?",
]
}Automated Evaluation Pipeline
class RAGEvaluator:
def __init__(self, rag_system, llm_judge):
self.rag = rag_system
self.judge = llm_judge
self.metrics_history = []
def evaluate(self, eval_set: Dict[str, List[str]]) -> Dict:
results = {}
for category, questions in eval_set.items():
category_results = []
for question in questions:
answer, contexts = self.rag.query(question)
metrics = {
"faithfulness": self.compute_faithfulness(answer, contexts),
"relevance": self.compute_relevance(question, answer),
"context_precision": self.compute_precision(
question, answer, contexts
),
}
category_results.append(metrics)
results[category] = {
"avg_faithfulness": np.mean([r["faithfulness"] for r in category_results]),
"avg_relevance": np.mean([r["relevance"] for r in category_results]),
"avg_precision": np.mean([r["context_precision"] for r in category_results]),
}
self.metrics_history.append({
"timestamp": datetime.now(),
"results": results
})
return results
def compare_versions(self, v1_results: Dict, v2_results: Dict) -> Dict:
"""Compare two versions of RAG system"""
comparison = {}
for category in v1_results:
comparison[category] = {
metric: v2_results[category][metric] - v1_results[category][metric]
for metric in v1_results[category]
}
return comparisonCommon Mistakes and Solutions
1. Ground Truth Dependency
Problem: Ground truth is too hard to create, so no evaluation happens
Solution: Faithfulness and Relevance don't require ground truth
2. Average Trap
Problem: Average Faithfulness is 0.8 but 0.3 on specific question types
Solution: Evaluate separately by question type
3. Metric Gaming
Problem: Making answers overly conservative to increase Faithfulness
Solution: Evaluate with Relevance too (detects too-short or tangential answers)
Conclusion
RAG evaluation must separately measure retrieval quality and answer quality.
Core Metrics:
- Context Quality — Context Recall (Was necessary info retrieved?), Context Precision (Was retrieval noise-free?)
- Answer Quality — Faithfulness (No hallucination?), Answer Relevance (Did it answer the question?)
Practical Recommendations:
- During development: Faithfulness + Relevance (quick feedback)
- Retrieval tuning: Context Recall (retrieval quality)
- Production: All metrics + per-category analysis
These four metrics let you diagnose exactly where your RAG system is failing.