Skip to content

Reliability & Flakiness

What Is a Flaky Test

A test that sometimes passes and sometimes fails with no code change. Flakiness destroys trust. Teams that see random failures start ignoring failures. A red build that nobody investigates is worse than no CI at all.


Root Causes

Cause Description
Async timing Test proceeds before async operation completes
Shared mutable state One test modifies data another test depends on
External dependencies Third-party API times out or returns unexpected data
Race conditions Parallel tests compete for the same resource
Hardcoded timestamps Test expects exact time that differs across runs
Non-deterministic ordering Test passes only when run after another test
OS / network variability CI machine slower than dev; timeouts too tight

Solutions by Cause

Async Timing → Explicit Waits

# Bad — guessing with sleep
import time
time.sleep(2)
assert api_client.orders.get(order_id)["status"] == "PROCESSED"

# Good — poll until condition or timeout
import time


def wait_for_order_status(
    api_client,
    order_id: str,
    expected: str,
    timeout: float = 10.0,
    interval: float = 0.5,
) -> None:
    deadline = time.monotonic() + timeout
    while time.monotonic() < deadline:
        status = api_client.orders.get(order_id)["status"]
        if status == expected:
            return
        time.sleep(interval)
    raise TimeoutError(
        f"Order {order_id} status never reached '{expected}' within {timeout}s"
    )

Shared State → Test Isolation

Every test creates its own data. No test reads data written by another test. Use builders with UUID-based defaults (see 03-test-data).

External Dependencies → Mocking

Isolate tests from unstable third parties using mocks or contract tests. Do not call real payment gateways or email services in automated tests.

Hardcoded Timestamps → Frozen Time

from freezegun import freeze_time


@freeze_time("2026-01-15 12:00:00")
def test_token_expires_after_24_hours(token_service):
    token = token_service.create(user_id="user-123")
    assert token_service.is_expired(token) is False

    # Advance time 25 hours
    with freeze_time("2026-01-16 13:00:00"):
        assert token_service.is_expired(token) is True

Quarantine Pattern

When a test is identified as flaky but cannot be fixed immediately:

@pytest.mark.xfail(
    reason="Flaky: race condition in order processing, tracked in ISSUE-4521",
    strict=False,
)
def test_order_processed_within_5s(api_client, verified_user):
    ...

strict=False — test is quarantined: pass or fail, CI stays green. strict=True — test must fail; use to track tests expected to fail (known bugs).

Quarantined tests must be tracked in a ticket. Not a permanent state.


Flakiness Risk Register

ID Test Cause Status
FL-001 test_order_processed_within_5s Async race condition Quarantined, ISSUE-4521
FL-002 test_email_sent_after_signup External SMTP timing Mocked, resolved
FL-003 test_checkout_redirect Slow CI animation Explicit wait added

Track flakiness systematically. A flaky test costs more in wasted investigation than the time it would take to fix.


Flakiness Detection in CI

# GitHub Actions — rerun on failure to detect flakiness
- name: Run tests with flakiness detection
  run: |
    uv run pytest -n auto --reruns=2 --reruns-delay=1 \
      --report-log=test-results/log.json

A test that passes on rerun was flaky. Mark it, investigate, fix it.


Stability Metrics

Track these per week:

Metric Formula Target
Flakiness rate flaky_failures / total_runs < 1%
Mean time to detect time from merge to failure detection < 10 min
False positive rate reruns that pass / total reruns < 0.5%