Skip to content

Digital Garden

DeepEval — LLM Testing Guide

serhiismetanskyi/digital-garden

DeepEval — LLM Testing Guide

A complete reference for testing LLM systems with DeepEval (v3.8+). Covers all metric categories, custom models and metrics, automated workflows, benchmarks, and production evaluation patterns.

Contents

File	Part	Topics
Introduction	1	What is DeepEval, LLM-as-Judge, key concepts, setup, all 47+ metrics by category

Metrics

File	Part	Topics
RAG Metrics	2	AnswerRelevancy, Faithfulness, ContextualPrecision, ContextualRecall, ContextualRelevancy, RAG Triad
LLM Quality Metrics	3	Hallucination, Toxicity, Bias, Summarization, GEval, PII Leakage, Misuse, NonAdvice, RoleViolation
AI Agent Metrics	4	ToolCorrectness, TaskCompletion, GoalAccuracy, ToolUse, PlanQuality, PlanAdherence, StepEfficiency, ArgumentCorrectness
Chatbot Metrics	5	KnowledgeRetention, ConversationCompleteness, RoleAdherence, TopicAdherence, ConversationSimulator
MCP Metrics	6	MCPUse, MCPTaskCompletion, MultiTurnMCPUse, MCP test case fields
Extra Metrics	7	ExactMatch, JsonCorrectness, PatternMatch, PromptAlignment, ConversationalGEval, ArenaGEval, DAGMetric, Turn-Level RAG
Multimodal Metrics	8	ImageCoherence, ImageEditing, ImageHelpfulness, ImageReference, TextToImage, MLLMImage

Testing & Customisation

File	Part	Topics
Red Teaming	9	RedTeamer, vulnerability scanning, attack enhancements, DeepTeam
Custom LLMs	10	DeepEvalBaseLLM, Azure/Ollama/local models, JSON confinement, custom embeddings
Custom Metrics	11	BaseMetric interface, ROUGE metric, composite metrics, Answer Correctness, threshold calibration
Testing Workflows	12	Synthesizer, CI/CD regression, hyperparameter tracking, LLM tracing, Confident AI
LLM Benchmarks	13	MMLU, HellaSwag, BIG-Bench Hard, HumanEval, TruthfulQA, GSM8K and 11 more
Conversational RAG Evaluation	14b	History management, controlled vs live RAG, failure mode design, metric divergence

Practical Guides

File	Part	Topics
Practical RAG Testing	14	LangChain + Ollama + Chroma pipeline, execution vs judge separation, full E2E walkthrough
RAG Diagnostic Testing	15	Baseline → parameter tuning → query-type analysis → cross-document testing
Datasets & Goldens	16	Golden vs TestCase, CSV/JSON save-load, all 4 Synthesizer methods, curation best practices
Prompts & Optimization	17	Prompt class, GEPA, MIPROv2, COPRO algorithms, prompt versioning
Configs, Flags & Reference	18	AsyncConfig, DisplayConfig, ErrorConfig, CacheConfig, CLI flags, env vars, RAGAS
Reference Appendix	19	DeepTeam migration, vulnerability classes, Arena test cases, troubleshooting
E2E Complex Tests	20	ConversationalTestCase with RAG + tools + multiple metrics, scenario checklist

Quick reference

Task	Where to start
Test a RAG pipeline	Part 2 → Part 14
Test an AI agent	Part 4 → Part 20
Test a chatbot	Part 5 → Part 20
Run security scan	Part 9
Use a local/custom LLM	Part 10 → Part 14
Build a custom metric	Part 11
Set up CI/CD	Part 12
Benchmark an LLM	Part 13
Fix low RAG scores	Part 15
Evaluate multi-turn RAG	Part 5 → Part 14b

QA Methodology — QA fundamentals, test design, defect management, Agile QA
Test Design Techniques — black-box, white-box, experience-based methods
Testing Pyramid — unit, integration, E2E strategy in practice
Test Automation Framework — framework design, config, CI integration
CI/CD Approaches — quality gates, pipeline testing strategy

See also