Skip to content

DeepEval — LLM Testing Guide

A complete reference for testing LLM systems with DeepEval (v3.8+). Covers all metric categories, custom models and metrics, automated workflows, benchmarks, and production evaluation patterns.


Contents

File Part Topics
Introduction 1 What is DeepEval, LLM-as-Judge, key concepts, setup, all 47+ metrics by category

Metrics

File Part Topics
RAG Metrics 2 AnswerRelevancy, Faithfulness, ContextualPrecision, ContextualRecall, ContextualRelevancy, RAG Triad
LLM Quality Metrics 3 Hallucination, Toxicity, Bias, Summarization, GEval, PII Leakage, Misuse, NonAdvice, RoleViolation
AI Agent Metrics 4 ToolCorrectness, TaskCompletion, GoalAccuracy, ToolUse, PlanQuality, PlanAdherence, StepEfficiency, ArgumentCorrectness
Chatbot Metrics 5 KnowledgeRetention, ConversationCompleteness, RoleAdherence, TopicAdherence, ConversationSimulator
MCP Metrics 6 MCPUse, MCPTaskCompletion, MultiTurnMCPUse, MCP test case fields
Extra Metrics 7 ExactMatch, JsonCorrectness, PatternMatch, PromptAlignment, ConversationalGEval, ArenaGEval, DAGMetric, Turn-Level RAG
Multimodal Metrics 8 ImageCoherence, ImageEditing, ImageHelpfulness, ImageReference, TextToImage, MLLMImage

Testing & Customisation

File Part Topics
Red Teaming 9 RedTeamer, vulnerability scanning, attack enhancements, DeepTeam
Custom LLMs 10 DeepEvalBaseLLM, Azure/Ollama/local models, JSON confinement, custom embeddings
Custom Metrics 11 BaseMetric interface, ROUGE metric, composite metrics, Answer Correctness, threshold calibration
Testing Workflows 12 Synthesizer, CI/CD regression, hyperparameter tracking, LLM tracing, Confident AI
LLM Benchmarks 13 MMLU, HellaSwag, BIG-Bench Hard, HumanEval, TruthfulQA, GSM8K and 11 more
Conversational RAG Evaluation 14b History management, controlled vs live RAG, failure mode design, metric divergence

Practical Guides

File Part Topics
Practical RAG Testing 14 LangChain + Ollama + Chroma pipeline, execution vs judge separation, full E2E walkthrough
RAG Diagnostic Testing 15 Baseline → parameter tuning → query-type analysis → cross-document testing
Datasets & Goldens 16 Golden vs TestCase, CSV/JSON save-load, all 4 Synthesizer methods, curation best practices
Prompts & Optimization 17 Prompt class, GEPA, MIPROv2, COPRO algorithms, prompt versioning
Configs, Flags & Reference 18 AsyncConfig, DisplayConfig, ErrorConfig, CacheConfig, CLI flags, env vars, RAGAS
Reference Appendix 19 DeepTeam migration, vulnerability classes, Arena test cases, troubleshooting
E2E Complex Tests 20 ConversationalTestCase with RAG + tools + multiple metrics, scenario checklist

Quick reference

Task Where to start
Test a RAG pipeline Part 2Part 14
Test an AI agent Part 4Part 20
Test a chatbot Part 5Part 20
Run security scan Part 9
Use a local/custom LLM Part 10Part 14
Build a custom metric Part 11
Set up CI/CD Part 12
Benchmark an LLM Part 13
Fix low RAG scores Part 15
Evaluate multi-turn RAG Part 5Part 14b


See also