Skip to content

DeepEval Guide — Part 18: Evaluation Configs, Flags & Reference

Configs for evaluate()

All configs are optional parameters passed to evaluate() or PromptOptimizer. They control concurrency, display, error handling, and caching.

AsyncConfig

from deepeval.evaluate import AsyncConfig
from deepeval import evaluate

evaluate(
    test_cases=[...], metrics=[...],
    async_config=AsyncConfig(
        run_async=True,       # concurrent evaluation (default True)
        max_concurrent=20,    # max parallel test cases (default 20)
        throttle_value=0,     # seconds between test cases (default 0)
    ),
)

Lower max_concurrent and increase throttle_value to avoid rate limits.

DisplayConfig

from deepeval.evaluate import DisplayConfig

evaluate(..., display_config=DisplayConfig(
    verbose_mode=None,     # override per-metric verbose_mode
    display="all",         # "all" | "failing" | "passing"
    show_indicator=True,   # progress indicator
    print_results=True,    # print results to console
))

ErrorConfig

from deepeval.evaluate import ErrorConfig

evaluate(..., error_config=ErrorConfig(
    skip_on_missing_params=False,  # skip metrics with missing params
    ignore_errors=False,           # ignore metric execution errors
))

CacheConfig

from deepeval.evaluate import CacheConfig

evaluate(..., cache_config=CacheConfig(
    use_cache=False,    # read from cached results
    write_cache=True,   # write results to disk
))

CLI Flags for deepeval test run

Flag Description
-n 4 Parallel processes (evaluate test cases in parallel)
-c Use cached results (skip previously evaluated cases)
-i Ignore errors during metric execution
-v Enable verbose mode for all metrics
-s Skip metrics with missing test case params
-id "name" Name the test run (identifier for Confident AI)
-d "failing" Display mode: "all", "passing", or "failing"
-r 2 Repeat each test case N times

Combine flags:

deepeval test run test_rag.py -n 4 -c -i -v

Hooks

Run custom code after each test run:

import deepeval

@deepeval.on_test_run_end
def after_test_run():
    print("Test run completed!")

End-to-End vs Component-Level Evaluation

DeepEval supports two evaluation approaches:

End-to-End (E2E)

Treats your LLM app as a black box. Evaluates overall input/output.

from deepeval import evaluate

evaluate(
    test_cases=test_cases,      # LLMTestCase or ConversationalTestCase
    metrics=[metric],
    hyperparameters={"model": "gpt-4o"},
)

Best for: Simple architectures (RAG QA, summarization, PDF extraction).

Component-Level

Evaluates individual components using @observe + update_current_span. Different metrics apply to different components.

from deepeval.tracing import observe, update_current_span
from deepeval.metrics import AnswerRelevancyMetric

@observe(metrics=[AnswerRelevancyMetric()])
def generator(query, chunks):
    res = llm.invoke(query + "\n".join(chunks))
    update_current_span(test_case=LLMTestCase(input=query, actual_output=res))
    return res

Then use evals_iterator() to run:

for golden in dataset.evals_iterator():
    your_llm_app(golden.input)

Best for: Complex architectures (agents, multi-step pipelines).


RAGAS Metric

DeepEval includes a wrapper for the ragas library — a combined metric averaging AnswerRelevancy, Faithfulness, ContextualPrecision, and ContextualRecall from the ragas package.

from deepeval.metrics.ragas import RagasMetric
from deepeval.test_case import LLMTestCase

metric = RagasMetric(threshold=0.5, model="gpt-4o-mini")
test_case = LLMTestCase(
    input="What is the return policy?",
    actual_output="30-day full refund.",
    expected_output="30 day full refund at no extra cost.",
    retrieval_context=["All customers eligible for 30 day full refund."],
)
metric.measure(test_case)

Note: DeepEval's native RAG metrics (Part 2) are recommended over RAGAS.


Key Environment Variables

Variable Effect
OPENAI_API_KEY OpenAI API key for default judge
CONFIDENT_API_KEY Confident AI login for cloud features
DEEPEVAL_TELEMETRY_OPT_OUT=1 Disable telemetry
DEEPEVAL_RESULTS_FOLDER Export test run JSON to directory
DEEPEVAL_VERBOSE_MODE=1 Global verbose mode
DEEPEVAL_DISABLE_TIMEOUTS=1 Disable enforced timeouts
DEEPEVAL_RETRY_MAX_ATTEMPTS Total retry attempts (default 2)
DEEPEVAL_FILE_SYSTEM=READ_ONLY Restrict file writes
deepeval set-openai --model=gpt-4o
deepeval set-anthropic -m claude-3-7-sonnet-latest
deepeval set-ollama --model=llama3
deepeval set-gemini --model=gemini-2.0-flash
deepeval set-local-model --model=my-model --base-url=http://localhost:8000/v1/

Full reference: deepeval.com/docs/environment-variables

Sources