DeepEval Guide — Part 12: Testing Workflows
Synthetic Data Generation (Synthesizer)
Manually creating test cases is slow. DeepEval's Synthesizer generates
thousands of test cases from your documents automatically.
Generate from Documents
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["docs/faq.txt", "docs/policy.pdf"],
chunk_size=1024,
chunk_overlap=0,
max_contexts_per_document=3,
)
Steps: load docs → chunk → group by similarity → generate goldens → evolve.
Chunking parameters:
- chunk_size — size of each chunk in tokens (default 1024)
- chunk_overlap — overlapping tokens between chunks (default 0)
- max_contexts_per_document — max contexts generated per doc (default 3)
Max goldens = max_contexts_per_document × max_goldens_per_context.
Best practices: align chunk_size with your retriever settings; use small overlap (50–100 tokens) for interconnected content; respect natural document sections (chapters, headings).
Generate from Contexts
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_contexts(
contexts=[
["Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0°C.", "Chemical formula for water is H2O."],
]
)
Evolutions (Increase Complexity)
from deepeval.synthesizer import Synthesizer, Evolution
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["docs/knowledge.txt"],
num_evolutions=3,
evolutions={
Evolution.REASONING: 0.2,
Evolution.MULTICONTEXT: 0.2,
Evolution.COMPARATIVE: 0.2,
Evolution.HYPOTHETICAL: 0.2,
Evolution.IN_BREADTH: 0.2,
},
)
| Evolution | What It Does |
|---|---|
| Reasoning | Requires multi-step logical thinking |
| Multicontext | Uses all relevant context information |
| Concretizing | Makes abstract ideas concrete |
| Constrained | Adds conditions/restrictions |
| Comparative | Requires comparison between options |
| Hypothetical | Forces hypothetical scenario reasoning |
| In-breadth | Broadens to related/adjacent topics |
Qualifying Synthetic Goldens
Synthesizer auto-filters low-quality data at two stages:
1. Context Filtering — scored on clarity, depth, structure, relevance (0–1 scale, threshold ≥ 0.5, 3 retries). Chunks grouped by cosine similarity ≥ 0.5.
2. Synthetic Input Filtering — scored on self-containment and clarity (same 0–1 scale, 3 retries).
Access quality scores: goldens[0].additional_metadata["context_quality"],
["synthetic_input_quality"], ["evolutions"].
Use synthesizer.to_pandas() for DataFrame view.
Pytest Integration Patterns
Shared Judge Model — conftest.py
In real test suites, create the judge LLM once per session in conftest.py
so it is reused across all test files:
"""tests/conftest.py"""
import pytest
from deepeval.models import GPTModel
from deepeval.models.base_model import DeepEvalBaseLLM
@pytest.fixture(scope="session")
def judge_model() -> DeepEvalBaseLLM:
"""LLM judge shared across all tests."""
return GPTModel(model="gpt-4o-mini", temperature=0.01)
Inject it into any test by adding judge_model as a parameter:
def test_answer_relevancy(judge_model):
metric = AnswerRelevancyMetric(model=judge_model, threshold=0.5, async_mode=False)
...
async_mode=False for Synchronous Tests
Always pass async_mode=False when running metrics inside pytest. Without
it, metrics try to run async event-loops that conflict with pytest's sync runner:
metric = FaithfulnessMetric(model=judge_model, threshold=0.5, async_mode=False)
metric.measure(test_case)
assert metric.success, f"score={metric.score}: {metric.reason}"
measure() vs assert_test()
Two equivalent approaches — choose based on what you need:
# Option A: assert_test — concise, recommended for simple cases
from deepeval import assert_test
assert_test(test_case, [metric])
# Option B: measure() — gives access to score and reason
metric.measure(test_case)
assert metric.success, f"{metric.__class__.__name__} score={metric.score}: {metric.reason}"
Use measure() when you need to: collect all failures before asserting
(see Failure Aggregation Pattern in Part 20), log scores, or check
assert not metric.success in negative/adversarial tests.
Negative Tests — assert not metric.success
For tests that verify a metric DETECTS a problem:
def test_hallucination_detected(judge_model):
"""Metric must fail on hallucinated output."""
test_case = LLMTestCase(
input="Tell me about the Nile.",
actual_output="The Nile is in South America and is 10000 km long.",
context=["The Nile is the longest river in Africa at 6650 km."],
)
metric = HallucinationMetric(model=judge_model, threshold=0.5, async_mode=False)
metric.measure(test_case)
assert not metric.success, f"Expected failure but score={metric.score}"
Rate Limit Protection — inter-test delay
To avoid hitting OpenAI rate limits in large test suites, add an
autouse fixture that pauses between tests:
import time
import pytest
@pytest.fixture(autouse=True)
def _inter_test_delay():
yield
time.sleep(1) # 1 second pause between every test
Regression Testing in CI/CD
Test File
"""test_rag.py"""
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[
LLMTestCase(input="...", actual_output="..."),
])
@pytest.mark.parametrize("test_case", dataset.test_cases)
def test_rag(test_case: LLMTestCase):
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.7),
])
Run: deepeval test run test_rag.py
GitHub Actions Workflow
name: LLM Regression Test
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install deepeval
- run: deepeval test run test_rag.py
env: { OPENAI_API_KEY: "${{ secrets.OPENAI_API_KEY }}" }
Optimizing Hyperparameters
Pass hyperparameters to evaluate() to track configs across runs:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
for model in ["gpt-4o", "gpt-3.5-turbo"]:
test_cases = build_test_cases(model)
evaluate(
test_cases=test_cases,
metrics=[AnswerRelevancyMetric()],
hyperparameters={"model": model, "prompt": "v2"},
)
For CI/CD: @deepeval.log_hyperparameters(model="gpt-4o", prompt="v2").
LLM Tracing and Observability
@observe Decorator
Types: tool, llm, agent, or custom span.
from deepeval.tracing import observe
@observe(type="tool")
def search_flights(origin, dest, date):
return [{"id": "FL123", "price": 450}]
@observe(type="agent")
def travel_agent(user_input):
flights = search_flights("NYC", "Paris", "2026-03-15")
return f"Found flights: {flights}"
evals_iterator — end-to-end agent evals
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
travel_agent(golden.input)
| End-to-end | evals_iterator(metrics=[...]) |
|---|---|
| Component | @observe(metrics=[...]) |
Datasets: Cloud Push/Pull
Store goldens in the cloud and reuse them across test runs:
from deepeval.dataset import EvaluationDataset, Golden
goldens = [Golden(input="What is RAG?", expected_output="...")]
dataset = EvaluationDataset(goldens=goldens)
dataset.push(alias="RAG QA Dataset") # upload to Confident AI
# Later, anywhere:
dataset = EvaluationDataset()
dataset.pull(alias="RAG QA Dataset") # download from cloud
Create test cases on-the-fly from goldens by calling your LLM:
from deepeval.test_case import LLMTestCase
test_cases = []
for golden in dataset.goldens:
answer, docs = my_agent.answer(golden.input)
test_cases.append(LLMTestCase(
input=golden.input,
actual_output=str(answer),
expected_output=golden.expected_output,
retrieval_context=docs,
))
assert_test with Golden + observed_callback
Combine golden input → LLM call → evaluation in one step for CI/CD:
"""test_agent.py"""
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="RAG QA Dataset")
@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent(golden):
assert_test(golden=golden, observed_callback=my_agent.answer)
observed_callback receives golden.input, calls your LLM,
and evaluates the output with metrics from @observe decorators.
Production Tracing Patterns
update_current_span — attach data to spans
from deepeval.tracing import observe, update_current_span
@observe(metrics=[ContextualRelevancyMetric()], name="Retriever")
def retrieve(query):
context = [d.page_content for d in store.similarity_search(query, k=3)]
update_current_span(input=query, retrieval_context=context)
return context
update_current_trace — group turns into conversations
from deepeval.tracing import observe, update_current_trace
@observe(type="agent")
def chat(session_id, user_input):
resp = agent.invoke({"input": user_input})
update_current_trace(thread_id=session_id, input=user_input, output=resp["output"])
return resp["output"]
evaluate_thread — online eval by thread ID
from deepeval.tracing import evaluate_thread
evaluate_thread(thread_id="session-abc", metric_collection="Prod Metrics")
Confident AI Platform Setup
Optional cloud platform for centralizing evaluation results:
deepeval login --save=dotenv:.env
This adds CONFIDENT_API_KEY and API_KEY to .env. DeepEval auto-reads
.env — no code changes needed. Results appear in the Confident AI dashboard
(test runs, per-case drilldowns, issue highlighting).
Free tier: one default project, limited test runs (can be deleted).
To stop pushing results, remove or comment out the API keys from .env.