Reliability & Flakiness
What Is a Flaky Test
A test that sometimes passes and sometimes fails with no code change. Flakiness destroys trust. Teams that see random failures start ignoring failures. A red build that nobody investigates is worse than no CI at all.
Root Causes
| Cause | Description |
|---|---|
| Async timing | Test proceeds before async operation completes |
| Shared mutable state | One test modifies data another test depends on |
| External dependencies | Third-party API times out or returns unexpected data |
| Race conditions | Parallel tests compete for the same resource |
| Hardcoded timestamps | Test expects exact time that differs across runs |
| Non-deterministic ordering | Test passes only when run after another test |
| OS / network variability | CI machine slower than dev; timeouts too tight |
Solutions by Cause
Async Timing → Explicit Waits
# Bad — guessing with sleep
import time
time.sleep(2)
assert api_client.orders.get(order_id)["status"] == "PROCESSED"
# Good — poll until condition or timeout
import time
def wait_for_order_status(
api_client,
order_id: str,
expected: str,
timeout: float = 10.0,
interval: float = 0.5,
) -> None:
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
status = api_client.orders.get(order_id)["status"]
if status == expected:
return
time.sleep(interval)
raise TimeoutError(
f"Order {order_id} status never reached '{expected}' within {timeout}s"
)
Shared State → Test Isolation
Every test creates its own data. No test reads data written by another test.
Use builders with UUID-based defaults (see 03-test-data).
External Dependencies → Mocking
Isolate tests from unstable third parties using mocks or contract tests. Do not call real payment gateways or email services in automated tests.
Hardcoded Timestamps → Frozen Time
from freezegun import freeze_time
@freeze_time("2026-01-15 12:00:00")
def test_token_expires_after_24_hours(token_service):
token = token_service.create(user_id="user-123")
assert token_service.is_expired(token) is False
# Advance time 25 hours
with freeze_time("2026-01-16 13:00:00"):
assert token_service.is_expired(token) is True
Quarantine Pattern
When a test is identified as flaky but cannot be fixed immediately:
@pytest.mark.xfail(
reason="Flaky: race condition in order processing, tracked in ISSUE-4521",
strict=False,
)
def test_order_processed_within_5s(api_client, verified_user):
...
strict=False — test is quarantined: pass or fail, CI stays green.
strict=True — test must fail; use to track tests expected to fail (known bugs).
Quarantined tests must be tracked in a ticket. Not a permanent state.
Flakiness Risk Register
| ID | Test | Cause | Status |
|---|---|---|---|
| FL-001 | test_order_processed_within_5s |
Async race condition | Quarantined, ISSUE-4521 |
| FL-002 | test_email_sent_after_signup |
External SMTP timing | Mocked, resolved |
| FL-003 | test_checkout_redirect |
Slow CI animation | Explicit wait added |
Track flakiness systematically. A flaky test costs more in wasted investigation than the time it would take to fix.
Flakiness Detection in CI
# GitHub Actions — rerun on failure to detect flakiness
- name: Run tests with flakiness detection
run: |
uv run pytest -n auto --reruns=2 --reruns-delay=1 \
--report-log=test-results/log.json
A test that passes on rerun was flaky. Mark it, investigate, fix it.
Stability Metrics
Track these per week:
| Metric | Formula | Target |
|---|---|---|
| Flakiness rate | flaky_failures / total_runs | < 1% |
| Mean time to detect | time from merge to failure detection | < 10 min |
| False positive rate | reruns that pass / total reruns | < 0.5% |