DeepEval Testing Guide for QA Engineers
Part 1: Introduction
What is DeepEval?
DeepEval is a Python testing framework for LLM (Large Language Model) systems. It works like pytest but for AI. You write test cases, pick metrics, and DeepEval checks if your LLM output is good enough.
Version used in this guide: deepeval==3.8.4
How Does LLM Testing Work?
In traditional testing, you compare output to expected result: assert 2 + 2 == 4.
In LLM testing, outputs are text. You cannot use exact match because the same question can have many correct answers. Instead, DeepEval uses LLM-as-Judge: another LLM evaluates if the output is good.
LLM-as-Judge Pattern
Your LLM (generates answer) --> Judge LLM (evaluates quality) --> Score (0.0 to 1.0)
- Your LLM produces the
actual_output - Judge LLM (like GPT-4) reads the output and scores it
- Score is between 0.0 (bad) and 1.0 (perfect)
- Threshold is the minimum score to pass (default: 0.5)
Key Concepts
| Concept | What It Means |
|---|---|
input |
The question or prompt sent to your LLM |
actual_output |
The answer your LLM returned |
expected_output |
The correct answer (ground truth) you wrote |
context |
Facts your LLM should know (provided by you) |
retrieval_context |
Documents your RAG system found |
tools_called |
Tools your AI agent used |
threshold |
Minimum score to pass the test (0.0-1.0) |
Test Case Types
DeepEval has two test case types:
- LLMTestCase — for single question-answer tests
- ConversationalTestCase — for multi-turn chat conversations
Setup
uv init my-llm-tests
cd my-llm-tests
uv add deepeval pytest
Set your OpenAI API key (DeepEval uses it for the judge LLM):
export OPENAI_API_KEY="sk-your-key-here"
How to Run Tests
# Run with pytest
uv run pytest tests/ -v
# Run with deepeval CLI (adds dashboard reporting)
uv run deepeval test run tests/ -v
Basic Test Example
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_basic_llm_output():
"""Check: does the answer match the question?"""
test_case = LLMTestCase(
input="What is Python?",
actual_output="Python is a programming language."
)
metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [metric])
What happens:
1. DeepEval sends input + actual_output to the judge LLM
2. Judge LLM scores how relevant the answer is (0.0 to 1.0)
3. If score >= 0.5 (threshold), test passes
4. If score < 0.5, test fails with a reason
All 47 Metrics by Category (as of v3.8)
Note: DeepEval continues to add metrics. The official docs list "50+ metrics" — the count below reflects the metrics covered in this guide.
| Category | Metrics |
|---|---|
| RAG (5) | AnswerRelevancy, Faithfulness, ContextualPrecision, ContextualRecall, ContextualRelevancy |
| Quality (5) | Hallucination, Toxicity, Bias, Summarization, GEval |
| Red Team (4) | PIILeakage, Misuse, NonAdvice, RoleViolation |
| Agent (8) | ToolCorrectness, TaskCompletion, GoalAccuracy, ToolUse, PlanQuality, PlanAdherence, StepEfficiency, ArgumentCorrectness |
| Chatbot (4) | KnowledgeRetention, ConversationCompleteness, RoleAdherence, TopicAdherence |
| MCP (3) | MCPUse, MCPTaskCompletion, MultiTurnMCPUse |
| Deterministic (3) | ExactMatch, JsonCorrectness, PatternMatch |
| Advanced (5) | PromptAlignment, ConversationalGEval, ArenaGEval, DAGMetric, ConversationalDAGMetric |
| Multimodal (5) | ImageCoherence, ImageEditing, ImageHelpfulness, ImageReference, TextToImage |
| Turn-Level (5) | TurnRelevancy, TurnFaithfulness, TurnContextualPrecision, TurnContextualRecall, TurnContextualRelevancy |
Guide Structure
| Part | Topic |
|---|---|
| 1 | Introduction (this file) |
| 2 | RAG Metrics + RAG Triad |
| 3 | LLM Quality + Safety Metrics |
| 4 | AI Agent Metrics |
| 5 | Chatbot Metrics + ConversationSimulator |
| 6 | MCP Metrics |
| 7 | Extra Metrics (Deterministic, DAG, Arena, Turn-Level) |
| 8 | Multimodal (Image) Metrics |
| 9 | Red Teaming (RedTeamer, scanning, vulnerabilities) |
| 10 | Custom LLMs and Embedding Models |
| 11 | Building Custom Metrics + Answer Correctness |
| 12 | Workflows (CI/CD, Synthesizer, Datasets, Tracing, Prod Evals) |
| 13 | LLM Benchmarks (MMLU, HellaSwag, DROP, HumanEval, etc.) |
| 14 | Practical RAG Testing (LangChain + Ollama + DeepEval) |
| 15 | RAG Diagnostic Testing (parameter tuning, query-type analysis) |
| 16 | Datasets and Goldens (data models, local save/load, curation) |
| 17 | Prompts & Prompt Optimization (GEPA, MIPROv2, COPRO) |
| 18 | Evaluation Configs, Flags & Reference (RAGAS, env vars) |
| 19 | Reference Appendix (DeepTeam vulns, Arena, troubleshooting) |
| 20 | End-to-End Complex Tests (conversational + RAG + tools in one test) |