DeepEval Guide — Part 15: RAG Diagnostic Testing

Based on the Improving Retrieval Quality series by Jeff Nyman (TesterStories, Feb 2026).

The Problem

You built a RAG system (Part 14), ran metrics, and got low scores. Now what? This part shows a systematic diagnostic methodology to find the root cause.

Phase 1: Establish a Baseline

Always start with a baseline configuration before changing anything:

BASELINE = {"chunk_size": 1000, "chunk_overlap": 200, "k": 3}

retriever, num_chunks = create_rag_system(**BASELINE)
results = run_rag_test(
    retriever=retriever,
    question="What energy source does the paper propose?",
    expected_output="Matter/antimatter annihilation requiring 10^28 kg.",
)
# Record: Contextual Precision = 0.33, Faithfulness = 0.67

Document your baseline scores — all future experiments compare to this.

Phase 2: Parameter Tuning Experiments

Run 4 experiments, changing ONE variable at a time:

Experiment 1: Smaller Chunks

retriever, _ = create_rag_system(chunk_size=500, chunk_overlap=100, k=3)
# Hypothesis: smaller chunks = more precise retrieval

Experiment 2: More Retrieval (Higher k)

retriever, _ = create_rag_system(chunk_size=1000, chunk_overlap=200, k=6)
# Hypothesis: more chunks = higher chance of finding relevant content

Experiment 3: Both Combined

retriever, _ = create_rag_system(chunk_size=500, chunk_overlap=100, k=6)
# Hypothesis: combining both improvements may compound benefits

Experiment 4: Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings

semantic_splitter = SemanticChunker(
    OllamaEmbeddings(model="nomic-embed-text"),
    breakpoint_threshold_type="percentile",
)
# Hypothesis: semantic boundaries produce more coherent chunks

Recording Results

experiments = {
    "baseline":       {"chunk_size": 1000, "chunk_overlap": 200, "k": 3},
    "small_chunks":   {"chunk_size": 500,  "chunk_overlap": 100, "k": 3},
    "more_retrieval": {"chunk_size": 1000, "chunk_overlap": 200, "k": 6},
    "both":           {"chunk_size": 500,  "chunk_overlap": 100, "k": 6},
}

for name, params in experiments.items():
    retriever, _ = create_rag_system(**params)
    results = run_rag_test(retriever, question, expected_output)
    scores = get_scores(results)
    print(f"{name}: CP={scores['Contextual Precision']}, "
          f"F={scores['Faithfulness']}")

Semantic chunking requires a different code path (see Experiment 4 above) because it replaces RecursiveCharacterTextSplitter entirely.

If ALL experiments fail to improve scores — the problem is NOT chunking strategy. Move to Phase 3.

Phase 3: Query-Type Analysis

Instead of fixing the system, question whether it's actually broken.

Insight: Different Queries Suit Different Content

Technical documents typically have four section types:

Conceptual — intros, theory, literature reviews (many keywords)
Methodological — procedures, equations, derivations (math-heavy)
Results — data, calculations, specific numbers (number-heavy)
Discussion — implications, interpretation (keywords again)

Semantic search works best on keyword-rich sections (conceptual, discussion) because embedding models can match those terms easily. Math-heavy and number-heavy sections have weak semantic signals, so factual queries about specific numbers often fail — even though the answer is there.

Test Multiple Query Types

queries = [
    {
        "question": "How does manipulating extra dimensions create a warp bubble?",
        "expected": "Extra dimensions are compactified using Kaluza-Klein modes...",
        "type": "conceptual",
    },
    {
        "question": "What role do Kaluza-Klein modes play?",
        "expected": "Kaluza-Klein modes provide the theoretical basis...",
        "type": "conceptual",
    },
    {
        "question": "What energy source does the paper propose?",
        "expected": "Matter/antimatter annihilation requiring 10^28 kg.",
        "type": "factual",
    },
]

for q in queries:
    results = run_rag_test(retriever, q["question"], q["expected"])
    scores = get_scores(results)
    print(f"[{q['type']}] {q['question'][:50]}...")
    print(f"  CP={scores['Contextual Precision']}, F={scores['Faithfulness']}")

Expected pattern: Conceptual queries score near 1.0 while factual queries score much lower. This proves the system works — just not for all query types equally.

The Diagnostic Cycle

Phase 1: Baseline              → "How bad is it?"
Phase 2: Parameter Tuning      → "Is it a configuration problem?"
Phase 3: Query-Type Tests      → "Is it a query-type problem?"
Phase 4: Cross-Document Tests  → "Is it a document-structure problem?"

Phase	All fail	Some succeed	All succeed
2 (params)	→ Phase 3	Use best config	Done ✅
3 (queries)	→ Phase 4	Query-type mismatch	System works ✅
4 (documents)	Fundamental issue	Structure-dependent	System works ✅

Phase 4: Cross-Document Testing

Test the same config and queries on documents with different structures:

documents = [
    {"path": "paper_equations.pdf", "type": "equation-heavy"},
    {"path": "paper_prose.pdf", "type": "prose-integrated"},
]

for doc in documents:
    retriever, _ = create_rag_system(doc["path"], **BASELINE)
    for q in queries:
        results = run_rag_test(retriever, q["question"], q["expected"])
        print(f"[{doc['type']}][{q['type']}] CP={get_scores(results)}")

Key insight — document structure determines retrievability:

Equation-heavy docs (formulas in dedicated sections) — factual queries fail; semantic search can't match sparse numeric content
Prose-integrated docs (equations in narrative) — factual queries succeed; surrounding text provides rich semantic signal

Fixes for Retrieval Failures

Hybrid search — combine semantic + keyword (BM25) retrieval
Metadata tags — label sections as "theoretical" vs "calculation"
Query routing — route factual vs conceptual queries differently
Multiple retrievers — use different strategies per query type

Key Lessons

Test multiple query types — conceptual, factual, and numerical
Test multiple document types — one document hides failure modes
Change one variable at a time — isolate each cause
Low scores ≠ broken system — they reveal operational characteristics
Multidimensional testing = configuration × query type × document type; single-dimension testing gives incomplete, misleading results