DeepEval Guide — Part 16: Datasets and Goldens
Golden vs Test Case
A Golden is a template — it holds input data and expected results
but is missing the dynamic parts (actual_output, retrieval_context,
tools_called) that your LLM produces at evaluation time.
A Test Case is a fully-formed evaluation unit with all fields populated.
Golden (what you expect) + LLM output = Test Case (what you evaluate)
Goldens are the preferred way to build datasets because they let you re-evaluate the same inputs across different LLM versions.
Golden Data Model
from deepeval.dataset import Golden
golden = Golden(
input="What is RAG?", # required
expected_output="Retrieval-Augmented...", # optional
context=["RAG combines retrieval..."], # optional
expected_tools=[ToolCall(...)], # optional
additional_metadata={"source": "faq"}, # optional
comments="Edge case — ambiguous query", # optional
custom_column_key_values={"priority": "high"}, # optional
)
For multi-turn conversations:
from deepeval.dataset import ConversationalGolden
golden = ConversationalGolden(
scenario="Frustrated user asking for a refund.", # required
expected_outcome="Redirected to a human agent.", # optional
user_description="Impatient, uses short sentences", # optional
context=["Refund policy: 30 days..."], # optional
)
Single-Turn vs Multi-Turn Datasets
Datasets are stateful — once set to single-turn or multi-turn, the type cannot be changed:
from deepeval.dataset import EvaluationDataset, Golden, ConversationalGolden
# Single-turn dataset
st_dataset = EvaluationDataset(goldens=[Golden(input="What is RAG?")])
print(st_dataset._multi_turn) # False
# Multi-turn dataset
mt_dataset = EvaluationDataset(
goldens=[ConversationalGolden(scenario="User asks for refund.")]
)
print(mt_dataset._multi_turn) # True
Adding Goldens and Test Cases
After initialization, use add_golden() and add_test_case():
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import LLMTestCase
dataset = EvaluationDataset(goldens=[Golden(input="What is RAG?")])
# Add more goldens
dataset.add_golden(Golden(input="Explain embeddings."))
# Add a test case (with actual LLM output)
dataset.add_test_case(LLMTestCase(
input="What is RAG?",
actual_output="RAG is Retrieval-Augmented Generation.",
))
Local Save/Load: JSON
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(goldens=[
Golden(input="What is RAG?", expected_output="..."),
])
# Save
dataset.save_as(file_type="json", directory="./datasets")
# Load
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
file_path="./datasets/dataset.json",
)
Local Save/Load: CSV
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
# Load goldens from CSV
dataset.add_goldens_from_csv_file(file_path="goldens.csv")
# Load test cases from CSV (map custom column names)
dataset.add_test_cases_from_csv_file(
file_path="test_data.csv",
input_col_name="query",
actual_output_col_name="actual_output",
expected_output_col_name="expected_output",
context_col_name="context",
context_col_delimiter=";",
)
Synthesizer: All 4 Generation Methods
Part 12 covers from_docs and from_contexts. Two more methods:
Generate from Scratch (no documents needed)
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_scratch(
subject="customer support for an e-commerce store",
num_goldens=20,
)
Useful when you have no knowledge base yet — just describe the domain.
Generate from Goldens (augment existing data)
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import Golden
seed_goldens = [
Golden(input="How do I return an item?"),
Golden(input="Where is my order?"),
]
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_goldens(
goldens=seed_goldens,
num_goldens=50,
)
Takes existing goldens and creates variations — different phrasings, edge cases, complexity levels.
All 4 Methods Summary
| Method | Input | Best For |
|---|---|---|
from_docs() |
Document files (PDF, TXT, DOCX) | RAG testing |
from_contexts() |
Pre-extracted context lists | Controlled evaluation |
from_scratch() |
Subject description string | No existing data |
from_goldens() |
Existing goldens list | Augmenting small datasets |
Dataset Curation Best Practices
- Diverse coverage — include easy, medium, hard inputs + edge cases
- Focused scope — each test case targets one evaluation goal
- Clear objectives — align datasets with specific metrics
- Start simple — begin with prompts you already manually check
- Version datasets — use
push(alias="v1"),push(alias="v2")