DeepEval — LLM Testing Guide
A complete reference for testing LLM systems with DeepEval (v3.8+).
Covers all metric categories, custom models and metrics, automated workflows,
benchmarks, and production evaluation patterns.
Contents
| File |
Part |
Topics |
| Introduction |
1 |
What is DeepEval, LLM-as-Judge, key concepts, setup, all 47+ metrics by category |
Metrics
| File |
Part |
Topics |
| RAG Metrics |
2 |
AnswerRelevancy, Faithfulness, ContextualPrecision, ContextualRecall, ContextualRelevancy, RAG Triad |
| LLM Quality Metrics |
3 |
Hallucination, Toxicity, Bias, Summarization, GEval, PII Leakage, Misuse, NonAdvice, RoleViolation |
| AI Agent Metrics |
4 |
ToolCorrectness, TaskCompletion, GoalAccuracy, ToolUse, PlanQuality, PlanAdherence, StepEfficiency, ArgumentCorrectness |
| Chatbot Metrics |
5 |
KnowledgeRetention, ConversationCompleteness, RoleAdherence, TopicAdherence, ConversationSimulator |
| MCP Metrics |
6 |
MCPUse, MCPTaskCompletion, MultiTurnMCPUse, MCP test case fields |
| Extra Metrics |
7 |
ExactMatch, JsonCorrectness, PatternMatch, PromptAlignment, ConversationalGEval, ArenaGEval, DAGMetric, Turn-Level RAG |
| Multimodal Metrics |
8 |
ImageCoherence, ImageEditing, ImageHelpfulness, ImageReference, TextToImage, MLLMImage |
Testing & Customisation
| File |
Part |
Topics |
| Red Teaming |
9 |
RedTeamer, vulnerability scanning, attack enhancements, DeepTeam |
| Custom LLMs |
10 |
DeepEvalBaseLLM, Azure/Ollama/local models, JSON confinement, custom embeddings |
| Custom Metrics |
11 |
BaseMetric interface, ROUGE metric, composite metrics, Answer Correctness, threshold calibration |
| Testing Workflows |
12 |
Synthesizer, CI/CD regression, hyperparameter tracking, LLM tracing, Confident AI |
| LLM Benchmarks |
13 |
MMLU, HellaSwag, BIG-Bench Hard, HumanEval, TruthfulQA, GSM8K and 11 more |
| Conversational RAG Evaluation |
14b |
History management, controlled vs live RAG, failure mode design, metric divergence |
Practical Guides
| File |
Part |
Topics |
| Practical RAG Testing |
14 |
LangChain + Ollama + Chroma pipeline, execution vs judge separation, full E2E walkthrough |
| RAG Diagnostic Testing |
15 |
Baseline → parameter tuning → query-type analysis → cross-document testing |
| Datasets & Goldens |
16 |
Golden vs TestCase, CSV/JSON save-load, all 4 Synthesizer methods, curation best practices |
| Prompts & Optimization |
17 |
Prompt class, GEPA, MIPROv2, COPRO algorithms, prompt versioning |
| Configs, Flags & Reference |
18 |
AsyncConfig, DisplayConfig, ErrorConfig, CacheConfig, CLI flags, env vars, RAGAS |
| Reference Appendix |
19 |
DeepTeam migration, vulnerability classes, Arena test cases, troubleshooting |
| E2E Complex Tests |
20 |
ConversationalTestCase with RAG + tools + multiple metrics, scenario checklist |
Quick reference
- QA Methodology — QA fundamentals, test design, defect management, Agile QA
- Test Design Techniques — black-box, white-box, experience-based methods
- Testing Pyramid — unit, integration, E2E strategy in practice
- Test Automation Framework — framework design, config, CI integration
- CI/CD Approaches — quality gates, pipeline testing strategy
See also