Skip to content

DeepEval Guide — Part 13: LLM Benchmarks

What Are LLM Benchmarks?

Benchmarks are standardized tests to measure LLM performance on various skills: reasoning, knowledge, code generation, truthfulness, bias.

A benchmark consists of: - Tasks — evaluation datasets with target labels (expected_output) - Scorer — determines if predictions are correct (usually exact match) - Prompting techniques — few-shot learning and/or Chain-of-Thought (CoT)

To benchmark any LLM, wrap it in DeepEvalBaseLLM (see Part 10).


Quick Start

from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask

benchmark = MMLU(
    tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE, MMLUTask.ASTRONOMY],
    n_shots=3,
)
benchmark.evaluate(model=my_custom_llm, batch_size=5)

print("Overall:", benchmark.overall_score)   # 0.0 — 1.0
print("Tasks:\n", benchmark.task_scores)     # pandas DataFrame
print("Details:\n", benchmark.predictions)   # per-question DataFrame

batch_size — generates outputs in batches if batch_generate() is implemented in your custom LLM. Available for all benchmarks except HumanEval and GSM8K.


All 16 Benchmarks

# Benchmark What It Tests Size Scorer Config
1 MMLU Academic knowledge (57 subjects) 14K MCQ Exact match tasks, n_shots (≤5)
2 HellaSwag Commonsense reasoning / sentence completion 10K Exact match tasks, n_shots (≤15)
3 BIG-Bench Hard 23 challenging multi-step reasoning tasks 6.5K Exact match tasks, n_shots (≤3), enable_cot
4 DROP Numerical reasoning over paragraphs 9.5K Exact match tasks, n_shots (≤5)
5 TruthfulQA Truthfulness (common misconceptions) 817 Exact/truth ID tasks, mode (MC1/MC2)
6 HumanEval Code generation (Python) 164 pass@k tasks, n
7 GSM8K Grade-school math word problems 1.3K Exact match n_problems, n_shots (≤3), enable_cot
8 IFEval Instruction-following compliance Custom n_problems
9 SQuAD Reading comprehension (Wikipedia QA) 10K LLM-as-judge tasks, n_shots (≤5), evaluation_model
10 MathQA Multi-step math reasoning (GRE/GMAT level) 37K MCQ Exact match tasks, n_shots (≤5)
11 LogiQA Logical/deductive reasoning 8.7K MCQ Exact match tasks, n_shots (≤5)
12 BoolQ Yes/No reading comprehension 3.3K Exact match n_problems, n_shots (≤5)
13 ARC Science reasoning (grades 3–9) 8K MCQ Exact match n_problems, n_shots (≤5), mode (EASY/CHALLENGE)
14 BBQ Social bias detection in QA 58K Exact match tasks, n_shots (≤5)
15 LAMBADA Context comprehension (predict last word) 5.2K Exact match n_problems, n_shots (≤5)
16 Winogrande Commonsense reasoning (binary choice) 1.3K Exact match n_problems, n_shots (≤5)

Configuration Options

Tasks — select benchmark subsets

from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask

benchmark = MMLU(tasks=[MMLUTask.MACHINE_LEARNING, MMLUTask.ASTRONOMY])

By default, all tasks are used. Each benchmark has its own Task enum.

Few-Shot Learning (n_shots)

from deepeval.benchmarks import HellaSwag

benchmark = HellaSwag(n_shots=5)

More shots → better format compliance → higher scores.

Chain-of-Thought (enable_cot)

Only BIG-Bench Hard and GSM8K support CoT:

from deepeval.benchmarks import BigBenchHard

benchmark = BigBenchHard(n_shots=3, enable_cot=True)

Code Examples by Benchmark

MMLU — 57 academic subjects, MCQ format

from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask

benchmark = MMLU(
    tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE],
    n_shots=5,
)
benchmark.evaluate(model=my_llm)

TruthfulQA — MC1 (single correct) or MC2 (multi-true)

from deepeval.benchmarks import TruthfulQA
from deepeval.benchmarks.tasks import TruthfulQATask
from deepeval.benchmarks.modes import TruthfulQAMode

benchmark = TruthfulQA(
    tasks=[TruthfulQATask.HEALTH, TruthfulQATask.FINANCE],
    mode=TruthfulQAMode.MC2,
)
benchmark.evaluate(model=my_llm)

HumanEval — code generation with pass@k

from deepeval.benchmarks import HumanEval
from deepeval.benchmarks.tasks import HumanEvalTask

benchmark = HumanEval(
    tasks=[HumanEvalTask.HAS_CLOSE_ELEMENTS, HumanEvalTask.SORT_NUMBERS],
    n=100,
)
benchmark.evaluate(model=my_llm, k=10)

Requires generate_samples(prompt, n, temperature) method on the model.

SQuAD — reading comprehension with LLM-as-judge

from deepeval.benchmarks import SQuAD
from deepeval.benchmarks.tasks import SQuADTask

benchmark = SQuAD(
    tasks=[SQuADTask.PHARMACY, SQuADTask.NORMANS],
    n_shots=3,
    evaluation_model="gpt-4.1",
)
benchmark.evaluate(model=my_llm)

GSM8K / ARC — numeric and science reasoning

from deepeval.benchmarks import GSM8K, ARC
from deepeval.benchmarks.modes import ARCMode

gsm = GSM8K(n_problems=100, n_shots=3, enable_cot=True)
arc = ARC(n_problems=100, n_shots=3, mode=ARCMode.CHALLENGE)

Reading Results

All benchmarks return: benchmark.overall_score (0.0–1.0), benchmark.task_scores (pandas DataFrame per task), benchmark.predictions (pandas DataFrame per question).


Important Notes

  1. Output format matters — most benchmarks require MCQ letter answers (A/B/C/D). If your LLM produces full sentences, scores will be near 0. Use JSON confinement (see Part 10) to ensure correct output format.

  2. Few-shot improves format — always use maximum allowed n_shots for best format compliance.

  3. batch_generate() — implement this optional method in your custom LLM to speed up benchmarking significantly.

  4. SQuAD is unique — it uses LLM-as-judge scoring instead of exact match, making it more flexible but requiring an evaluation_model.

  5. HumanEval is unique — uses pass@k metric and requires generate_samples() method (not generate()).

Sources