DeepEval Guide — Part 15: RAG Diagnostic Testing
Based on the Improving Retrieval Quality series by Jeff Nyman (TesterStories, Feb 2026).
The Problem
You built a RAG system (Part 14), ran metrics, and got low scores. Now what? This part shows a systematic diagnostic methodology to find the root cause.
Phase 1: Establish a Baseline
Always start with a baseline configuration before changing anything:
BASELINE = {"chunk_size": 1000, "chunk_overlap": 200, "k": 3}
retriever, num_chunks = create_rag_system(**BASELINE)
results = run_rag_test(
retriever=retriever,
question="What energy source does the paper propose?",
expected_output="Matter/antimatter annihilation requiring 10^28 kg.",
)
# Record: Contextual Precision = 0.33, Faithfulness = 0.67
Document your baseline scores — all future experiments compare to this.
Phase 2: Parameter Tuning Experiments
Run 4 experiments, changing ONE variable at a time:
Experiment 1: Smaller Chunks
retriever, _ = create_rag_system(chunk_size=500, chunk_overlap=100, k=3)
# Hypothesis: smaller chunks = more precise retrieval
Experiment 2: More Retrieval (Higher k)
retriever, _ = create_rag_system(chunk_size=1000, chunk_overlap=200, k=6)
# Hypothesis: more chunks = higher chance of finding relevant content
Experiment 3: Both Combined
retriever, _ = create_rag_system(chunk_size=500, chunk_overlap=100, k=6)
# Hypothesis: combining both improvements may compound benefits
Experiment 4: Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
semantic_splitter = SemanticChunker(
OllamaEmbeddings(model="nomic-embed-text"),
breakpoint_threshold_type="percentile",
)
# Hypothesis: semantic boundaries produce more coherent chunks
Recording Results
experiments = {
"baseline": {"chunk_size": 1000, "chunk_overlap": 200, "k": 3},
"small_chunks": {"chunk_size": 500, "chunk_overlap": 100, "k": 3},
"more_retrieval": {"chunk_size": 1000, "chunk_overlap": 200, "k": 6},
"both": {"chunk_size": 500, "chunk_overlap": 100, "k": 6},
}
for name, params in experiments.items():
retriever, _ = create_rag_system(**params)
results = run_rag_test(retriever, question, expected_output)
scores = get_scores(results)
print(f"{name}: CP={scores['Contextual Precision']}, "
f"F={scores['Faithfulness']}")
Semantic chunking requires a different code path (see Experiment 4 above)
because it replaces RecursiveCharacterTextSplitter entirely.
If ALL experiments fail to improve scores — the problem is NOT chunking strategy. Move to Phase 3.
Phase 3: Query-Type Analysis
Instead of fixing the system, question whether it's actually broken.
Insight: Different Queries Suit Different Content
Technical documents typically have four section types:
- Conceptual — intros, theory, literature reviews (many keywords)
- Methodological — procedures, equations, derivations (math-heavy)
- Results — data, calculations, specific numbers (number-heavy)
- Discussion — implications, interpretation (keywords again)
Semantic search works best on keyword-rich sections (conceptual, discussion) because embedding models can match those terms easily. Math-heavy and number-heavy sections have weak semantic signals, so factual queries about specific numbers often fail — even though the answer is there.
Test Multiple Query Types
queries = [
{
"question": "How does manipulating extra dimensions create a warp bubble?",
"expected": "Extra dimensions are compactified using Kaluza-Klein modes...",
"type": "conceptual",
},
{
"question": "What role do Kaluza-Klein modes play?",
"expected": "Kaluza-Klein modes provide the theoretical basis...",
"type": "conceptual",
},
{
"question": "What energy source does the paper propose?",
"expected": "Matter/antimatter annihilation requiring 10^28 kg.",
"type": "factual",
},
]
for q in queries:
results = run_rag_test(retriever, q["question"], q["expected"])
scores = get_scores(results)
print(f"[{q['type']}] {q['question'][:50]}...")
print(f" CP={scores['Contextual Precision']}, F={scores['Faithfulness']}")
Expected pattern: Conceptual queries score near 1.0 while factual queries score much lower. This proves the system works — just not for all query types equally.
The Diagnostic Cycle
Phase 1: Baseline → "How bad is it?"
Phase 2: Parameter Tuning → "Is it a configuration problem?"
Phase 3: Query-Type Tests → "Is it a query-type problem?"
Phase 4: Cross-Document Tests → "Is it a document-structure problem?"
| Phase | All fail | Some succeed | All succeed |
|---|---|---|---|
| 2 (params) | → Phase 3 | Use best config | Done ✅ |
| 3 (queries) | → Phase 4 | Query-type mismatch | System works ✅ |
| 4 (documents) | Fundamental issue | Structure-dependent | System works ✅ |
Phase 4: Cross-Document Testing
Test the same config and queries on documents with different structures:
documents = [
{"path": "paper_equations.pdf", "type": "equation-heavy"},
{"path": "paper_prose.pdf", "type": "prose-integrated"},
]
for doc in documents:
retriever, _ = create_rag_system(doc["path"], **BASELINE)
for q in queries:
results = run_rag_test(retriever, q["question"], q["expected"])
print(f"[{doc['type']}][{q['type']}] CP={get_scores(results)}")
Key insight — document structure determines retrievability:
- Equation-heavy docs (formulas in dedicated sections) — factual queries fail; semantic search can't match sparse numeric content
- Prose-integrated docs (equations in narrative) — factual queries succeed; surrounding text provides rich semantic signal
Fixes for Retrieval Failures
- Hybrid search — combine semantic + keyword (BM25) retrieval
- Metadata tags — label sections as "theoretical" vs "calculation"
- Query routing — route factual vs conceptual queries differently
- Multiple retrievers — use different strategies per query type
Key Lessons
- Test multiple query types — conceptual, factual, and numerical
- Test multiple document types — one document hides failure modes
- Change one variable at a time — isolate each cause
- Low scores ≠ broken system — they reveal operational characteristics
- Multidimensional testing = configuration × query type × document type; single-dimension testing gives incomplete, misleading results