DeepEval Guide — Part 2: RAG Metrics

What is RAG?

RAG = Retrieval-Augmented Generation. Your system: 1. Takes a user question 2. Searches a database for relevant documents 3. Sends those documents + question to an LLM 4. LLM generates an answer based on those documents

RAG metrics test TWO things: - Did the system find the RIGHT documents? (retrieval quality) - Did the LLM use those documents CORRECTLY? (generation quality)

Metric 1: Answer Relevancy

What it checks: Is the answer about the same topic as the question?

How it works: The judge LLM reads the question and answer, then scores how well the answer addresses what was asked. It does NOT check if the answer is correct — only if it talks about the right topic. An answer can be relevant but wrong ("Paris is in Germany" is relevant to "Where is Paris?" but incorrect).

Required fields: input, actual_output

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_answer_relevancy():
    """Answer talks about the topic from the question."""
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="The capital of France is Paris."
    )
    metric = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

def test_answer_relevancy_fail():
    """Answer talks about something completely different."""
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="The Eiffel Tower was built in 1889 for the World's Fair."
    )
    metric = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

Metric 2: Faithfulness

What it checks: Does the answer only use facts from the retrieved documents?

This is your main protection against hallucination — when the LLM invents information that is not in the source documents.

How it works: The judge LLM breaks actual_output into separate facts (e.g. "employees get 20 days", "vacation is paid"), then checks whether each fact exists in retrieval_context. If the answer contains made-up facts, it fails.

Required fields: input, actual_output, retrieval_context

Not the same as HallucinationMetric (Part 3). Faithfulness uses retrieval_context — documents your RAG system found automatically. HallucinationMetric uses context — facts you provide manually as ground truth. Faithfulness uses a min threshold (high score = good, pass ≥), Hallucination uses a max threshold (low score = good, pass ≤).

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_faithfulness_pass():
    """All facts in the answer come from retrieved docs."""
    test_case = LLMTestCase(
        input="What is the company vacation policy?",
        actual_output="Employees get 20 days of paid vacation per year.",
        retrieval_context=[
            "Company policy: All full-time employees receive "
            "20 days of paid vacation annually."
        ],
    )
    metric = FaithfulnessMetric(threshold=0.7)
    assert_test(test_case, [metric])

def test_faithfulness_fail():
    """Answer contains facts NOT in the retrieved documents."""
    test_case = LLMTestCase(
        input="What is the company vacation policy?",
        actual_output="Employees get 30 days of vacation and free lunch.",
        retrieval_context=[
            "Company policy: All full-time employees receive "
            "20 days of paid vacation annually."
        ],
    )
    metric = FaithfulnessMetric(threshold=0.7)
    assert_test(test_case, [metric])

Metric 3: Contextual Precision

What it checks: Are the useful documents ranked higher than useless ones?

Why ranking matters: The LLM reads retrieved documents from top to bottom. If irrelevant documents come first, the LLM may get confused or ignore the useful content that appears later. Good ranking = better answers.

How it works: The judge compares each document in retrieval_context to expected_output. Documents that help answer the question should appear first.

Required fields: input, actual_output, expected_output, retrieval_context

from deepeval import assert_test
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

def test_contextual_precision():
    """Useful document is ranked first."""
    test_case = LLMTestCase(
        input="What causes rain?",
        actual_output="Rain is caused by water evaporation and condensation.",
        expected_output="Rain happens when water evaporates, rises, "
        "condenses into clouds, and falls as droplets.",
        retrieval_context=[
            "Water evaporates from the surface, rises into the "
            "atmosphere, cools, and condenses into rain droplets.",
            "The history of weather forecasting dates back to "
            "ancient civilizations who observed cloud patterns.",
        ],
    )
    metric = ContextualPrecisionMetric(threshold=0.7)
    assert_test(test_case, [metric])

Metric 4: Contextual Recall

What it checks: Did the retriever find ALL the information needed?

How it works: The judge checks if every sentence in expected_output can be supported by at least one document in retrieval_context.

Required fields: input, actual_output, expected_output, retrieval_context

from deepeval import assert_test
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

def test_contextual_recall():
    """Retrieved docs cover all facts from expected answer."""
    test_case = LLMTestCase(
        input="What are symptoms of flu?",
        actual_output="Flu symptoms include fever, cough, and body aches.",
        expected_output="Flu symptoms: fever, cough, body aches, fatigue.",
        retrieval_context=[
            "Influenza symptoms include high fever, persistent "
            "cough, muscle and body aches, and extreme fatigue.",
        ],
    )
    metric = ContextualRecallMetric(threshold=0.7)
    assert_test(test_case, [metric])

Metric 5: Contextual Relevancy

What it checks: Are ALL retrieved documents relevant to the question?

How it works: The judge reads each document in retrieval_context and checks if it is relevant to the input. If you retrieved 3 documents but only 1 is useful, the score will be low.

Required fields: input, actual_output, retrieval_context

from deepeval import assert_test
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_contextual_relevancy():
    """All retrieved documents are relevant to the question."""
    test_case = LLMTestCase(
        input="How does photosynthesis work?",
        actual_output="Plants convert sunlight into energy using chlorophyll.",
        retrieval_context=[
            "Photosynthesis is the process where plants use sunlight, "
            "water, and CO2 to produce glucose and oxygen.",
            "Chlorophyll in plant leaves absorbs light energy "
            "to drive the photosynthesis reaction.",
        ],
    )
    metric = ContextualRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

The RAG Triad (Referenceless Evaluation)

If you do NOT have expected_output (no ground truth), you can still evaluate RAG using three metrics that need only input, actual_output, and retrieval_context:

Answer Relevancy — is the answer relevant to the question?
Faithfulness — does the answer use only facts from the documents?
Contextual Relevancy — are the retrieved documents relevant?

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric,
)
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="How long can I stay after graduation on F-1 visa?",
    actual_output="You can stay up to 60 days after completing your degree.",
    retrieval_context=[
        "F-1 visa holders may stay for 60 days after degree completion."
    ],
)
evaluate(
    test_cases=[test_case],
    metrics=[
        AnswerRelevancyMetric(),
        FaithfulnessMetric(),
        ContextualRelevancyMetric(),
    ],
)

The evaluate() function runs all metrics on all test cases in bulk. Use it instead of assert_test when evaluating datasets (not pytest).

RAG Metrics Summary Table

Metric	Question It Answers	Required Fields
Answer Relevancy	Is the answer relevant to the question?	input, actual_output
Faithfulness	Does the answer only use facts from docs?	input, actual_output, retrieval_context
Contextual Precision	Are relevant docs ranked higher?	input, actual_output, expected_output, retrieval_context
Contextual Recall	Did we find ALL needed info?	input, actual_output, expected_output, retrieval_context
Contextual Relevancy	Are ALL retrieved docs useful?	input, actual_output, retrieval_context

RAGAS Metric (ragas library wrapper)

DeepEval also includes RagasMetric — a wrapper that averages all four RAG metrics from the ragas library. Install with pip install ragas. DeepEval's native RAG metrics above are recommended instead (debuggable, JSON-confineable, better ecosystem integration). See Part 18 for details.