Reliability

Reliability means the system works correctly under failures. Not if failures happen, but when.

12.1 Retry

When a request fails, retry after a wait. Rules: - Only retry idempotent operations (GET, PUT, DELETE). For POST, use idempotency keys. - Use exponential backoff: 100ms → 200ms → 400ms → 800ms. - Add jitter (random 0-50% to each backoff interval). Prevents synchronized retry storms. - Set max retries (3-5). Do not retry forever. - Respect Retry-After header from server.

import time
import random


class TransientError(Exception):
    """Network or server errors worth retrying (e.g. 503, connection reset)."""


def retry_with_backoff(fn, max_attempts=4):
    for attempt in range(max_attempts):
        try:
            return fn()
        except TransientError:
            if attempt == max_attempts - 1:
                raise
            wait = (2 ** attempt) * 0.1 + random.uniform(0, 0.1)
            time.sleep(wait)

12.2 Circuit Breaker

A circuit breaker stops calling a failing service. This prevents cascading failures.

States

Closed (normal): requests pass through. Track error rate.
Open (failing): all requests fail immediately without calling the service. Wait for cool-down.
Half-open (testing): send one probe request. If success → Closed. If fail → Open again.

Configuration

Failure threshold: 50% errors in 20+ requests → Open.
Cool-down period: 30-60 seconds in Open state before testing.
Half-open probes: 1-5 test requests.

For critical dependencies (auth, payments): use stricter thresholds (20-30% failure rate). For non-critical (recommendations, analytics): use looser thresholds (40-50%).

12.3 Timeout

Set a timeout on every external call. Never use an infinite timeout.

Recommended defaults: - Internal service calls: 500ms - 1000ms. - Database queries: 200ms - 500ms. - External APIs: 3000ms - 5000ms.

If a timeout fires, return a clear error — do not hang waiting. A hanging request holds a thread, a connection, and memory.

12.4 Graceful Degradation

When a dependency is down, degrade gracefully rather than fail completely.

Examples: - Recommendations service down → show popular items (cached fallback). - Search service slow → serve cached last results. - Payment service unavailable → queue the transaction for retry.

Classify dependencies: - Critical (auth, core data): return error if unavailable. - Non-critical (enrichment, recommendations): return default/cached/empty.

12.5 Graceful Shutdown

On SIGTERM (deployment, scale-down): 1. Stop accepting new requests (close listen socket). 2. Finish in-flight requests (with a max wait of 15-30s). 3. Close database connections and message consumers. 4. Exit.

Prevents dropped requests during deployments.