DeepEval Guide — Part 19: Reference Appendix
DeepTeam Migration (Red Teaming)
Red teaming moved to dedicated package DeepTeam (pip install deepteam).
Works inside deepeval's ecosystem (custom models, metrics).
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks import PromptInjection
from deepteam.attacks.multi_turn import LinearJailbreaking
def model_callback(input: str) -> str:
return f"I'm sorry but I can't answer this: {input}"
risk = red_team(
model_callback=model_callback,
vulnerabilities=[Bias(types=["race"])],
attacks=[PromptInjection(), LinearJailbreaking()],
)
Red Teaming Vulnerability Classes
All classes live in deepteam.vulnerabilities. Each accepts types list.
| Class | Purpose | Example Types |
|---|---|---|
Bias |
Identify/mitigate biases | "race", "gender", "religion" |
Toxicity |
Resist harmful/offensive content | "race", "disability" |
Misinformation |
Avoid false/misleading content | "factual error" |
PIILeakage |
Resist disclosing personal info | "direct pii disclosure" |
Robustness |
Resist malicious inputs/hijacking | "hijacking" |
UnauthorizedAccess |
Resist security exploits | "rbac" |
PersonalSafety |
Avoid jeopardizing individual safety | "copyright violations" |
ExcessiveAgency |
Stay within intended scope | "functionality" |
IllegalActivity |
Resist promoting unlawful actions | "violent crime" |
IntellectualProperty |
Avoid IP infringement | "copyright violations" |
PromptLeakage |
Resist revealing system prompt | "secrets and credentials" |
Competition |
Avoid disclosing competitive info | "race" |
GraphicContent |
Resist explicit material | "sexual content" |
Arena Test Cases (Full Reference)
An ArenaTestCase compares N contestants to pick the best output.
Works with ArenaGEval metric only.
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant
c1 = Contestant(
name="GPT-4", hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(input="Capital of France?", actual_output="Paris"),
)
c2 = Contestant(
name="Claude-4", hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(input="Capital of France?",
actual_output="Paris is the capital of France."),
)
test_case = ArenaTestCase(contestants=[c1, c2])
Rules: all input/expected_output must match across contestants.
Contestant params: name (str), test_case (LLMTestCase), hyperparameters (dict, optional).
Images in Arena
from deepeval.test_case import MLLMImage
shoes = MLLMImage(url="./shoes.png", local=True)
contestant = Contestant(
name="GPT-4V",
test_case=LLMTestCase(input=f"What's here? {shoes}", actual_output="Red shoes"),
)
Running Arena Evaluation
from deepeval import compare
from deepeval.metrics import ArenaGEval
from deepeval.test_case import LLMTestCaseParams
arena = ArenaGEval(
name="Friendly",
criteria="Choose the friendlier contestant",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)
results = compare(test_cases=[test_case], metric=arena)
gpt4_wins = results.get("GPT-4", 0)
claude_wins = results.get("Claude-4", 0)
# results is a dict[str, int]: {"GPT-4": 1, "Claude-4": 0}
Arena masks names and randomizes order. No pass/fail — only win counts.
Data Privacy & Telemetry
Telemetry opt-out — DeepEval tracks basic usage (eval count, metric type) via Sentry. No PII collected. To disable:
export DEEPEVAL_TELEMETRY_OPT_OUT=1
Error reporting (opt-in) — off by default:
export ERROR_REPORTING=1
Confident AI storage — data in private AWS cloud. Only your org has access. VIP plan available for enhanced compliance: confident-ai.com/pricing.
Troubleshooting
TLS / SSL Errors
If uploads fail with SSLCertVerificationError:
curl -v https://api.confident-ai.com/ # 1. check curl
unset REQUESTS_CA_BUNDLE SSL_CERT_FILE SSL_CERT_DIR # 2. check python
python -c "import requests; print(requests.get('https://api.confident-ai.com').status_code)"
Logging
import logging
logging.basicConfig(level=logging.DEBUG)
| Variable | Purpose |
|---|---|
LOG_LEVEL |
Global log level (DEBUG, INFO) |
DEEPEVAL_VERBOSE_MODE |
Extra warnings/diagnostics |
DEEPEVAL_LOG_STACK_TRACES |
Stack traces in retry logs |
Timeout Tuning
export DEEPEVAL_PER_TASK_TIMEOUT_SECONDS_OVERRIDE=600
export DEEPEVAL_PER_ATTEMPT_TIMEOUT_SECONDS_OVERRIDE=120
export DEEPEVAL_RETRY_MAX_ATTEMPTS=2
See also: DEEPEVAL_TASK_GATHER_BUFFER_SECONDS_OVERRIDE,
DEEPEVAL_RETRY_INITIAL_SECONDS, DEEPEVAL_RETRY_EXP_BASE.
Dotenv Loading
Loads .env at import. Priority: .env → .env.{APP_ENV} → .env.local.
DEEPEVAL_DISABLE_DOTENV=1 pytest -q # skip dotenv
ENV_DIR_PATH=/path/to/project pytest -q # custom dir
Reset / Persist Settings
from deepeval.config.settings import reset_settings, get_settings
reset_settings(reload_dotenv=True)
settings = get_settings()
with settings.edit(save="dotenv"):
settings.DEEPEVAL_VERBOSE_MODE = True
Miscellaneous
Opt-in to version update warnings:
export DEEPEVAL_UPDATE_WARNING_OPT_IN=1