RAG Evaluation: Beyond Precision/Recall

"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.

Why Traditional Metrics Fall Short

Traditional IR (Information Retrieval) metrics:

Problem: Can't distinguish between good retrieval with bad answer, or mediocre retrieval with good answer.

Case 1: Good Retrieval, Bad Answer — 3 relevant documents retrieved (High Precision), but LLM distorts content in answer (Hallucination)

Case 2: Mediocre Retrieval, Good Answer — Only 1 relevant document retrieved (Low Precision), but that document enabled accurate answer

The Three Axes of RAG Evaluation

RAG systems should be evaluated on three axes:

Query → Retrieval → Generation

Retrieval stage → Context Quality (Context Recall, Context Precision)
Generation stage → Answer Quality (Faithfulness, Answer Relevance)

1. Context Quality

How well do retrieved documents match the question?

Context Recall: Was necessary information retrieved?
Context Precision: What fraction of retrieved docs are actually useful?

2. Answer Quality

How good is the generated answer?

Faithfulness: Is the answer grounded in retrieved documents? (Hallucination check)
Answer Relevance: Does the answer address the question?

3. End-to-End Quality

Final quality of the entire pipeline

Answer Correctness: Is the answer actually correct? (Requires ground truth)

Core Metrics Deep Dive

1. Faithfulness

Are all claims in the answer supported by retrieved documents?

def compute_faithfulness(answer: str, contexts: List[str]) -> float:
    """
    1. Extract individual claims from answer
    2. Check if each claim is supported by context
    3. Return ratio of supported claims
    """
    claims = extract_claims(answer)
    supported = 0

    for claim in claims:
        if is_supported_by_context(claim, contexts):
            supported += 1

    return supported / len(claims) if claims else 0

Example:

Context: "Tesla cut prices by up to 20% on January 13, 2023."

Answer: "Tesla cut prices by 20% in January 2023, which caused competitors to lower their prices too."

Claims:

"Tesla cut prices in January 2023" → Supported ✓
"Prices cut by 20%" → Supported ✓
"Competitors lowered prices" → Not in context ✗

Faithfulness = 2/3 = 0.67

Why it matters: Low Faithfulness = Hallucination risk

2. Answer Relevance

Does the answer actually answer the question?

def compute_answer_relevance(question: str, answer: str) -> float:
    """
    1. Generate questions from the answer
    2. Measure similarity between original and generated questions
    """
    # Guess the question from just the answer
    generated_questions = generate_questions_from_answer(answer, n=3)

    # Similarity to original question
    similarities = [
        semantic_similarity(question, gen_q)
        for gen_q in generated_questions
    ]

    return np.mean(similarities)

Example:

Question: "Who is Tesla's CEO?"

Answer: "Tesla is an electric vehicle company."

Generated Questions from Answer:

"What kind of company is Tesla?"
"What are electric vehicle companies?"

Low similarity to original → Low Answer Relevance

Why it matters: Detects when LLM ignored the question or gave tangential answer

3. Context Recall

Is the information needed for the answer present in retrieved documents?

def compute_context_recall(
    ground_truth: str,
    contexts: List[str]
) -> float:
    """
    1. Extract key statements from ground truth
    2. Check if each statement is supported by context
    """
    gt_statements = extract_statements(ground_truth)
    attributed = 0

    for statement in gt_statements:
        if any(supports(ctx, statement) for ctx in contexts):
            attributed += 1

    return attributed / len(gt_statements) if gt_statements else 0

Example:

Ground Truth: "Sam Altman was fired on November 17, 2023, and returned on November 22."

Contexts Retrieved:

[1] "Sam Altman fired from OpenAI (2023-11-17)"
[2] "Microsoft CEO expressed support for Sam Altman"

Ground Truth Statements:

"Sam Altman fired 2023-11-17" → Supported by Context 1 ✓
"Sam Altman returned 2023-11-22" → Not found ✗

Context Recall = 1/2 = 0.5

Why it matters: Directly measures retrieval failure (missing necessary docs)

4. Context Precision

What fraction of retrieved documents actually contributed to the answer?

def compute_context_precision(
    question: str,
    answer: str,
    contexts: List[str]
) -> float:
    """
    Check if each retrieved context contributed to the answer
    """
    useful = 0

    for ctx in contexts:
        if contributes_to_answer(ctx, question, answer):
            useful += 1

    return useful / len(contexts) if contexts else 0

Why it matters: Too much noise confuses the LLM → degrades answer quality

Relationship Between Metrics

Question → Context Quality (Recall, Precision) → Answer Quality (Faithfulness, Relevance) → Answer Correctness

Practical Implementation: Using RAGAS

RAGAS is a framework for RAG evaluation that computes these metrics easily.

Installation and Basic Usage

# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["Who is Tesla's CEO?"],
    "answer": ["Elon Musk is Tesla's CEO."],
    "contexts": [["Elon Musk is Tesla's CEO and founder."]],
    "ground_truth": ["Elon Musk"]  # Needed for Context Recall
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

print(results)

Batch Evaluation

def evaluate_rag_batch(
    questions: List[str],
    rag_system,
    ground_truths: List[str] = None
) -> pd.DataFrame:
    """Evaluate RAG system on multiple questions"""

    results = []
    for i, question in enumerate(questions):
        # Run RAG
        answer, contexts = rag_system.query(question)

        # Evaluate
        result = {
            "question": question,
            "answer": answer,
            "faithfulness": compute_faithfulness(answer, contexts),
            "relevance": compute_answer_relevance(question, answer),
            "context_precision": compute_context_precision(
                question, answer, contexts
            ),
        }

        if ground_truths:
            result["context_recall"] = compute_context_recall(
                ground_truths[i], contexts
            )

        results.append(result)

    return pd.DataFrame(results)

Evaluation Without Ground Truth: LLM-as-Judge

When ground truth is unavailable, use an LLM as evaluator.

Faithfulness Evaluation

FAITHFULNESS_PROMPT = """
Given the context and answer, determine if each claim in the answer
is supported by the context.

Context:
{context}

Answer:
{answer}

For each claim in the answer, respond with:
- Claim: [the claim]
- Verdict: [Supported/Not Supported]
- Evidence: [quote from context if supported]

Finally, provide the overall faithfulness score (0-1).
"""

def llm_faithfulness(answer: str, context: str, llm) -> float:
    prompt = FAITHFULNESS_PROMPT.format(context=context, answer=answer)
    response = llm.generate(prompt)
    return parse_faithfulness_score(response)

Answer Relevance Evaluation

RELEVANCE_PROMPT = """
Given the question and answer, rate how relevant the answer is
to the question on a scale of 0-1.

Question: {question}
Answer: {answer}

Consider:
- Does the answer address the question directly?
- Is the answer complete?
- Is there irrelevant information?

Score (0-1):
Reasoning:
"""

def llm_relevance(question: str, answer: str, llm) -> float:
    prompt = RELEVANCE_PROMPT.format(question=question, answer=answer)
    response = llm.generate(prompt)
    return parse_relevance_score(response)

Evaluation Strategy: When to Measure What

Metrics by Development Stage

Evaluation Set Design

# Include diverse question types
eval_set = {
    "simple": [  # Single doc sufficient
        "Who is Tesla's CEO?",
        "When was OpenAI founded?",
    ],
    "multi_hop": [  # Multiple docs needed
        "What did Microsoft's CEO say when OpenAI's CEO was fired?",
    ],
    "temporal": [  # Time reasoning required
        "Who was CEO before Sam Altman returned?",
    ],
    "comparison": [  # Comparison questions
        "Which sold more in 2023, Tesla or BYD?",
    ],
    "unanswerable": [  # Cannot be answered
        "What are Tesla's 2025 sales figures?",
    ]
}

Automated Evaluation Pipeline

class RAGEvaluator:
    def __init__(self, rag_system, llm_judge):
        self.rag = rag_system
        self.judge = llm_judge
        self.metrics_history = []

    def evaluate(self, eval_set: Dict[str, List[str]]) -> Dict:
        results = {}

        for category, questions in eval_set.items():
            category_results = []

            for question in questions:
                answer, contexts = self.rag.query(question)

                metrics = {
                    "faithfulness": self.compute_faithfulness(answer, contexts),
                    "relevance": self.compute_relevance(question, answer),
                    "context_precision": self.compute_precision(
                        question, answer, contexts
                    ),
                }
                category_results.append(metrics)

            results[category] = {
                "avg_faithfulness": np.mean([r["faithfulness"] for r in category_results]),
                "avg_relevance": np.mean([r["relevance"] for r in category_results]),
                "avg_precision": np.mean([r["context_precision"] for r in category_results]),
            }

        self.metrics_history.append({
            "timestamp": datetime.now(),
            "results": results
        })

        return results

    def compare_versions(self, v1_results: Dict, v2_results: Dict) -> Dict:
        """Compare two versions of RAG system"""
        comparison = {}
        for category in v1_results:
            comparison[category] = {
                metric: v2_results[category][metric] - v1_results[category][metric]
                for metric in v1_results[category]
            }
        return comparison

Common Mistakes and Solutions

1. Ground Truth Dependency

Problem: Ground truth is too hard to create, so no evaluation happens

Solution: Faithfulness and Relevance don't require ground truth

2. Average Trap

Problem: Average Faithfulness is 0.8 but 0.3 on specific question types

Solution: Evaluate separately by question type

3. Metric Gaming

Problem: Making answers overly conservative to increase Faithfulness

Solution: Evaluate with Relevance too (detects too-short or tangential answers)

Conclusion

RAG evaluation must separately measure retrieval quality and answer quality.

Core Metrics:

Context Quality — Context Recall (Was necessary info retrieved?), Context Precision (Was retrieval noise-free?)
Answer Quality — Faithfulness (No hallucination?), Answer Relevance (Did it answer the question?)

Practical Recommendations:

During development: Faithfulness + Relevance (quick feedback)
Retrieval tuning: Context Recall (retrieval quality)
Production: All metrics + per-category analysis

These four metrics let you diagnose exactly where your RAG system is failing.

RAG Evaluation: Beyond Precision/Recall

"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.

Why Traditional Metrics Fall Short

Traditional IR (Information Retrieval) metrics:

Problem: Can't distinguish between good retrieval with bad answer, or mediocre retrieval with good answer.

Case 1: Good Retrieval, Bad Answer — 3 relevant documents retrieved (High Precision), but LLM distorts content in answer (Hallucination)

Case 2: Mediocre Retrieval, Good Answer — Only 1 relevant document retrieved (Low Precision), but that document enabled accurate answer

The Three Axes of RAG Evaluation

RAG systems should be evaluated on three axes:

Query → Retrieval → Generation

Retrieval stage → Context Quality (Context Recall, Context Precision)
Generation stage → Answer Quality (Faithfulness, Answer Relevance)

1. Context Quality

How well do retrieved documents match the question?

Context Recall: Was necessary information retrieved?
Context Precision: What fraction of retrieved docs are actually useful?

2. Answer Quality

How good is the generated answer?

Faithfulness: Is the answer grounded in retrieved documents? (Hallucination check)
Answer Relevance: Does the answer address the question?

3. End-to-End Quality

Final quality of the entire pipeline

Answer Correctness: Is the answer actually correct? (Requires ground truth)

Core Metrics Deep Dive

1. Faithfulness

Are all claims in the answer supported by retrieved documents?

def compute_faithfulness(answer: str, contexts: List[str]) -> float:
    """
    1. Extract individual claims from answer
    2. Check if each claim is supported by context
    3. Return ratio of supported claims
    """
    claims = extract_claims(answer)
    supported = 0

    for claim in claims:
        if is_supported_by_context(claim, contexts):
            supported += 1

    return supported / len(claims) if claims else 0

Example:

Context: "Tesla cut prices by up to 20% on January 13, 2023."

Answer: "Tesla cut prices by 20% in January 2023, which caused competitors to lower their prices too."

Claims:

"Tesla cut prices in January 2023" → Supported ✓
"Prices cut by 20%" → Supported ✓
"Competitors lowered prices" → Not in context ✗

Faithfulness = 2/3 = 0.67

Why it matters: Low Faithfulness = Hallucination risk

2. Answer Relevance

Does the answer actually answer the question?

def compute_answer_relevance(question: str, answer: str) -> float:
    """
    1. Generate questions from the answer
    2. Measure similarity between original and generated questions
    """
    # Guess the question from just the answer
    generated_questions = generate_questions_from_answer(answer, n=3)

    # Similarity to original question
    similarities = [
        semantic_similarity(question, gen_q)
        for gen_q in generated_questions
    ]

    return np.mean(similarities)

Example:

Question: "Who is Tesla's CEO?"

Answer: "Tesla is an electric vehicle company."

Generated Questions from Answer:

"What kind of company is Tesla?"
"What are electric vehicle companies?"

Low similarity to original → Low Answer Relevance

Why it matters: Detects when LLM ignored the question or gave tangential answer

3. Context Recall

Is the information needed for the answer present in retrieved documents?

def compute_context_recall(
    ground_truth: str,
    contexts: List[str]
) -> float:
    """
    1. Extract key statements from ground truth
    2. Check if each statement is supported by context
    """
    gt_statements = extract_statements(ground_truth)
    attributed = 0

    for statement in gt_statements:
        if any(supports(ctx, statement) for ctx in contexts):
            attributed += 1

    return attributed / len(gt_statements) if gt_statements else 0

Example:

Ground Truth: "Sam Altman was fired on November 17, 2023, and returned on November 22."

Contexts Retrieved:

[1] "Sam Altman fired from OpenAI (2023-11-17)"
[2] "Microsoft CEO expressed support for Sam Altman"

Ground Truth Statements:

"Sam Altman fired 2023-11-17" → Supported by Context 1 ✓
"Sam Altman returned 2023-11-22" → Not found ✗

Context Recall = 1/2 = 0.5

Why it matters: Directly measures retrieval failure (missing necessary docs)

4. Context Precision

What fraction of retrieved documents actually contributed to the answer?

def compute_context_precision(
    question: str,
    answer: str,
    contexts: List[str]
) -> float:
    """
    Check if each retrieved context contributed to the answer
    """
    useful = 0

    for ctx in contexts:
        if contributes_to_answer(ctx, question, answer):
            useful += 1

    return useful / len(contexts) if contexts else 0

Why it matters: Too much noise confuses the LLM → degrades answer quality

Relationship Between Metrics

Question → Context Quality (Recall, Precision) → Answer Quality (Faithfulness, Relevance) → Answer Correctness

Practical Implementation: Using RAGAS

RAGAS is a framework for RAG evaluation that computes these metrics easily.

Installation and Basic Usage

# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["Who is Tesla's CEO?"],
    "answer": ["Elon Musk is Tesla's CEO."],
    "contexts": [["Elon Musk is Tesla's CEO and founder."]],
    "ground_truth": ["Elon Musk"]  # Needed for Context Recall
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

print(results)

Batch Evaluation

def evaluate_rag_batch(
    questions: List[str],
    rag_system,
    ground_truths: List[str] = None
) -> pd.DataFrame:
    """Evaluate RAG system on multiple questions"""

    results = []
    for i, question in enumerate(questions):
        # Run RAG
        answer, contexts = rag_system.query(question)

        # Evaluate
        result = {
            "question": question,
            "answer": answer,
            "faithfulness": compute_faithfulness(answer, contexts),
            "relevance": compute_answer_relevance(question, answer),
            "context_precision": compute_context_precision(
                question, answer, contexts
            ),
        }

        if ground_truths:
            result["context_recall"] = compute_context_recall(
                ground_truths[i], contexts
            )

        results.append(result)

    return pd.DataFrame(results)

Evaluation Without Ground Truth: LLM-as-Judge

When ground truth is unavailable, use an LLM as evaluator.

Faithfulness Evaluation

FAITHFULNESS_PROMPT = """
Given the context and answer, determine if each claim in the answer
is supported by the context.

Context:
{context}

Answer:
{answer}

For each claim in the answer, respond with:
- Claim: [the claim]
- Verdict: [Supported/Not Supported]
- Evidence: [quote from context if supported]

Finally, provide the overall faithfulness score (0-1).
"""

def llm_faithfulness(answer: str, context: str, llm) -> float:
    prompt = FAITHFULNESS_PROMPT.format(context=context, answer=answer)
    response = llm.generate(prompt)
    return parse_faithfulness_score(response)

Answer Relevance Evaluation

RELEVANCE_PROMPT = """
Given the question and answer, rate how relevant the answer is
to the question on a scale of 0-1.

Question: {question}
Answer: {answer}

Consider:
- Does the answer address the question directly?
- Is the answer complete?
- Is there irrelevant information?

Score (0-1):
Reasoning:
"""

def llm_relevance(question: str, answer: str, llm) -> float:
    prompt = RELEVANCE_PROMPT.format(question=question, answer=answer)
    response = llm.generate(prompt)
    return parse_relevance_score(response)

Evaluation Strategy: When to Measure What

Metrics by Development Stage

Evaluation Set Design

# Include diverse question types
eval_set = {
    "simple": [  # Single doc sufficient
        "Who is Tesla's CEO?",
        "When was OpenAI founded?",
    ],
    "multi_hop": [  # Multiple docs needed
        "What did Microsoft's CEO say when OpenAI's CEO was fired?",
    ],
    "temporal": [  # Time reasoning required
        "Who was CEO before Sam Altman returned?",
    ],
    "comparison": [  # Comparison questions
        "Which sold more in 2023, Tesla or BYD?",
    ],
    "unanswerable": [  # Cannot be answered
        "What are Tesla's 2025 sales figures?",
    ]
}

Automated Evaluation Pipeline

class RAGEvaluator:
    def __init__(self, rag_system, llm_judge):
        self.rag = rag_system
        self.judge = llm_judge
        self.metrics_history = []

    def evaluate(self, eval_set: Dict[str, List[str]]) -> Dict:
        results = {}

        for category, questions in eval_set.items():
            category_results = []

            for question in questions:
                answer, contexts = self.rag.query(question)

                metrics = {
                    "faithfulness": self.compute_faithfulness(answer, contexts),
                    "relevance": self.compute_relevance(question, answer),
                    "context_precision": self.compute_precision(
                        question, answer, contexts
                    ),
                }
                category_results.append(metrics)

            results[category] = {
                "avg_faithfulness": np.mean([r["faithfulness"] for r in category_results]),
                "avg_relevance": np.mean([r["relevance"] for r in category_results]),
                "avg_precision": np.mean([r["context_precision"] for r in category_results]),
            }

        self.metrics_history.append({
            "timestamp": datetime.now(),
            "results": results
        })

        return results

    def compare_versions(self, v1_results: Dict, v2_results: Dict) -> Dict:
        """Compare two versions of RAG system"""
        comparison = {}
        for category in v1_results:
            comparison[category] = {
                metric: v2_results[category][metric] - v1_results[category][metric]
                for metric in v1_results[category]
            }
        return comparison

Common Mistakes and Solutions

1. Ground Truth Dependency

Problem: Ground truth is too hard to create, so no evaluation happens

Solution: Faithfulness and Relevance don't require ground truth

2. Average Trap

Problem: Average Faithfulness is 0.8 but 0.3 on specific question types

Solution: Evaluate separately by question type

3. Metric Gaming

Problem: Making answers overly conservative to increase Faithfulness

Solution: Evaluate with Relevance too (detects too-short or tangential answers)

Conclusion

RAG evaluation must separately measure retrieval quality and answer quality.

Core Metrics:

Context Quality — Context Recall (Was necessary info retrieved?), Context Precision (Was retrieval noise-free?)
Answer Quality — Faithfulness (No hallucination?), Answer Relevance (Did it answer the question?)

Practical Recommendations:

During development: Faithfulness + Relevance (quick feedback)
Retrieval tuning: Context Recall (retrieval quality)
Production: All metrics + per-category analysis

These four metrics let you diagnose exactly where your RAG system is failing.

RAG Evaluation: Beyond Precision/Recall

Why Traditional Metrics Fall Short

The Three Axes of RAG Evaluation

1. Context Quality

2. Answer Quality

3. End-to-End Quality

Core Metrics Deep Dive

1. Faithfulness

2. Answer Relevance

3. Context Recall

4. Context Precision

Relationship Between Metrics

Practical Implementation: Using RAGAS

Installation and Basic Usage

Batch Evaluation

Evaluation Without Ground Truth: LLM-as-Judge

Faithfulness Evaluation

Answer Relevance Evaluation

Evaluation Strategy: When to Measure What

Metrics by Development Stage

Evaluation Set Design

Automated Evaluation Pipeline

Common Mistakes and Solutions

1. Ground Truth Dependency

2. Average Trap

3. Metric Gaming

Conclusion

Related Posts

RAG Evaluation: Beyond Precision/Recall

Why Traditional Metrics Fall Short

The Three Axes of RAG Evaluation

1. Context Quality

2. Answer Quality

3. End-to-End Quality

Core Metrics Deep Dive

1. Faithfulness

2. Answer Relevance

3. Context Recall

4. Context Precision

Relationship Between Metrics

Practical Implementation: Using RAGAS

Installation and Basic Usage

Batch Evaluation

Evaluation Without Ground Truth: LLM-as-Judge

Faithfulness Evaluation

Answer Relevance Evaluation

Evaluation Strategy: When to Measure What

Metrics by Development Stage

Evaluation Set Design

Automated Evaluation Pipeline

Common Mistakes and Solutions

1. Ground Truth Dependency

2. Average Trap

3. Metric Gaming

Conclusion

Related Posts