DeepEval Guide — Part 18: Evaluation Configs, Flags & Reference
Configs for evaluate()
All configs are optional parameters passed to evaluate() or
PromptOptimizer. They control concurrency, display, error handling,
and caching.
AsyncConfig
from deepeval.evaluate import AsyncConfig
from deepeval import evaluate
evaluate(
test_cases=[...], metrics=[...],
async_config=AsyncConfig(
run_async=True, # concurrent evaluation (default True)
max_concurrent=20, # max parallel test cases (default 20)
throttle_value=0, # seconds between test cases (default 0)
),
)
Lower max_concurrent and increase throttle_value to avoid rate limits.
DisplayConfig
from deepeval.evaluate import DisplayConfig
evaluate(..., display_config=DisplayConfig(
verbose_mode=None, # override per-metric verbose_mode
display="all", # "all" | "failing" | "passing"
show_indicator=True, # progress indicator
print_results=True, # print results to console
))
ErrorConfig
from deepeval.evaluate import ErrorConfig
evaluate(..., error_config=ErrorConfig(
skip_on_missing_params=False, # skip metrics with missing params
ignore_errors=False, # ignore metric execution errors
))
CacheConfig
from deepeval.evaluate import CacheConfig
evaluate(..., cache_config=CacheConfig(
use_cache=False, # read from cached results
write_cache=True, # write results to disk
))
CLI Flags for deepeval test run
| Flag | Description |
|---|---|
-n 4 |
Parallel processes (evaluate test cases in parallel) |
-c |
Use cached results (skip previously evaluated cases) |
-i |
Ignore errors during metric execution |
-v |
Enable verbose mode for all metrics |
-s |
Skip metrics with missing test case params |
-id "name" |
Name the test run (identifier for Confident AI) |
-d "failing" |
Display mode: "all", "passing", or "failing" |
-r 2 |
Repeat each test case N times |
Combine flags:
deepeval test run test_rag.py -n 4 -c -i -v
Hooks
Run custom code after each test run:
import deepeval
@deepeval.on_test_run_end
def after_test_run():
print("Test run completed!")
End-to-End vs Component-Level Evaluation
DeepEval supports two evaluation approaches:
End-to-End (E2E)
Treats your LLM app as a black box. Evaluates overall input/output.
from deepeval import evaluate
evaluate(
test_cases=test_cases, # LLMTestCase or ConversationalTestCase
metrics=[metric],
hyperparameters={"model": "gpt-4o"},
)
Best for: Simple architectures (RAG QA, summarization, PDF extraction).
Component-Level
Evaluates individual components using @observe + update_current_span.
Different metrics apply to different components.
from deepeval.tracing import observe, update_current_span
from deepeval.metrics import AnswerRelevancyMetric
@observe(metrics=[AnswerRelevancyMetric()])
def generator(query, chunks):
res = llm.invoke(query + "\n".join(chunks))
update_current_span(test_case=LLMTestCase(input=query, actual_output=res))
return res
Then use evals_iterator() to run:
for golden in dataset.evals_iterator():
your_llm_app(golden.input)
Best for: Complex architectures (agents, multi-step pipelines).
RAGAS Metric
DeepEval includes a wrapper for the ragas library — a combined metric
averaging AnswerRelevancy, Faithfulness, ContextualPrecision, and
ContextualRecall from the ragas package.
from deepeval.metrics.ragas import RagasMetric
from deepeval.test_case import LLMTestCase
metric = RagasMetric(threshold=0.5, model="gpt-4o-mini")
test_case = LLMTestCase(
input="What is the return policy?",
actual_output="30-day full refund.",
expected_output="30 day full refund at no extra cost.",
retrieval_context=["All customers eligible for 30 day full refund."],
)
metric.measure(test_case)
Note: DeepEval's native RAG metrics (Part 2) are recommended over RAGAS.
Key Environment Variables
| Variable | Effect |
|---|---|
OPENAI_API_KEY |
OpenAI API key for default judge |
CONFIDENT_API_KEY |
Confident AI login for cloud features |
DEEPEVAL_TELEMETRY_OPT_OUT=1 |
Disable telemetry |
DEEPEVAL_RESULTS_FOLDER |
Export test run JSON to directory |
DEEPEVAL_VERBOSE_MODE=1 |
Global verbose mode |
DEEPEVAL_DISABLE_TIMEOUTS=1 |
Disable enforced timeouts |
DEEPEVAL_RETRY_MAX_ATTEMPTS |
Total retry attempts (default 2) |
DEEPEVAL_FILE_SYSTEM=READ_ONLY |
Restrict file writes |
Provider selection via CLI (recommended)
deepeval set-openai --model=gpt-4o
deepeval set-anthropic -m claude-3-7-sonnet-latest
deepeval set-ollama --model=llama3
deepeval set-gemini --model=gemini-2.0-flash
deepeval set-local-model --model=my-model --base-url=http://localhost:8000/v1/
Full reference: deepeval.com/docs/environment-variables