DeepEval Guide — Part 14: Practical RAG Testing
Based on the AI and Testing series by Jeff Nyman (TesterStories, Feb 2026).
Why This Matters
Parts 2 and 12 describe RAG metrics and workflows in theory. This part shows how to build a real RAG pipeline, connect local LLMs via Ollama, and evaluate retrieval quality step by step.
Setup: Ollama as Both Execution and Judge Model
Instead of using OpenAI for the judge, you can run everything locally:
from langchain_ollama import OllamaEmbeddings, ChatOllama
# Execution model — generates answers
execution_model = ChatOllama(model="llama3.2")
For the judge model, use one of two approaches:
Option A: CLI shortcut (simplest, sets judge globally):
deepeval set-ollama --model=llama3.2
Then all metrics automatically use Ollama as judge — no Python code needed.
Option B: Custom wrapper (fine-grained control per metric):
from deepeval.models import DeepEvalBaseLLM
from langchain_ollama import ChatOllama
class OllamaJudge(DeepEvalBaseLLM):
def __init__(self, model_name: str = "llama3.2"):
self._model_name = model_name
self._model = ChatOllama(model=model_name)
def load_model(self):
return self._model
def generate(self, prompt: str) -> str:
return self._model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
res = await self._model.ainvoke(prompt)
return res.content
def get_model_name(self) -> str:
return self._model_name
judge_model = OllamaJudge(model_name="llama3.2")
ChatOllama (from LangChain) is used for RAG answer generation.
The judge model (wrapper or CLI) is used by DeepEval metrics for evaluation.
Building a RAG System with LangChain
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
def create_rag_system(chunk_size=1000, chunk_overlap=200, k=3):
"""Create a RAG retriever with configurable parameters."""
loader = PyPDFLoader("./my_document.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
)
chunks = text_splitter.split_documents(documents)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": k})
return retriever, len(chunks)
Parameters to tune later:
- chunk_size — how big each text chunk is (characters)
- chunk_overlap — overlap between chunks (context continuity)
- k — how many chunks to retrieve per query
End-to-End Test Function
from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
def run_rag_test(retriever, question, expected_output):
"""Retrieve context, generate answer, evaluate with DeepEval."""
# Step 1: Retrieve relevant chunks
retrieved_docs = retriever.invoke(question)
context = [doc.page_content for doc in retrieved_docs]
# Step 2: Generate answer using execution model
prompt = f"Based on this context: {context}\n\nQuestion: {question}"
response = execution_model.invoke(prompt).content
# Step 3: Create DeepEval test case
test_case = LLMTestCase(
input=question,
actual_output=response,
expected_output=expected_output,
retrieval_context=context,
)
# Step 4: Evaluate with multiple metrics
precision = ContextualPrecisionMetric(model=judge_model, verbose_mode=True)
faithfulness = FaithfulnessMetric(model=judge_model, verbose_mode=True)
results = evaluate(test_cases=[test_case], metrics=[precision, faithfulness])
return results
Interpreting Combined Scores
| Contextual Precision | Faithfulness | Meaning |
|---|---|---|
| High | High | Retriever finds right docs, LLM uses them correctly |
| High | Low | Right docs found, but LLM invents extra facts |
| Low | High | Wrong docs found, but LLM stays faithful to them |
| Low | Low | Wrong docs found AND LLM hallucinates |
Running the Full Pipeline
retriever, num_chunks = create_rag_system(
chunk_size=1000, chunk_overlap=200, k=3,
)
print(f"Document split into {num_chunks} chunks")
results = run_rag_test(
retriever=retriever,
question="What energy source does the paper propose?",
expected_output="The paper proposes matter/antimatter annihilation.",
)
# Access scores
metrics_data = results.test_results[0].metrics_data
for m in metrics_data:
print(f"{m.name}: {m.score}")
Execution vs Judge — When to Use What
| Role | Class | From | Purpose |
|---|---|---|---|
| Execution | ChatOllama |
langchain_ollama |
Answer generation (your RAG) |
| Judge | OllamaJudge wrapper |
custom (see above) | DeepEval metric evaluation |
| Judge (alt) | CLI deepeval set-ollama |
built-in | Global judge (no code) |
You can use different models for each role:
# Small, fast model for generation
execution_model = ChatOllama(model="phi3")
# Larger, smarter model for evaluation
judge_model = OllamaJudge(model_name="llama3.2")
This execution vs judge separation is a best practice: the judge should be equal or stronger than the model being tested.
Dependencies
uv add deepeval langchain-community langchain-ollama chromadb pypdf
Ollama must be running locally: ollama serve
Pre-pull models:
ollama pull llama3.2
ollama pull nomic-embed-text
Debugging: View Retrieved Chunks
Always check what your retriever actually returns. Scores alone do not tell the full story — you need to see the raw text:
retrieved_docs = retriever.invoke(question)
for i, doc in enumerate(retrieved_docs, 1):
print(f"--- Chunk {i} ---")
print(doc.page_content[:200]) # first 200 chars
print()
If the chunks are about the wrong topic, no metric tuning will help. Fix the retrieval first.
Key Takeaway
This practical workflow shows how to go from document → RAG pipeline → DeepEval evaluation entirely locally with Ollama. No OpenAI API key needed. See Part 15 for how to iterate and diagnose retrieval failures.