Cross-Cutting: Performance, Scalability and Reliability

Performance

Target SLOs: p50 < 100ms, p95 < 500ms, p99 < 1000ms for most APIs. Adjust per use case. Internal services should be faster than public APIs.
Sources of latency: network round-trip, serialization/deserialization, database queries, external service calls, garbage collection pauses.
Optimization strategies:
Connection pooling: reuse database and HTTP connections instead of creating new ones
Query optimization: add indexes, analyze query plans, avoid N+1 queries
Response compression: gzip or brotli for responses larger than 1KB
Payload minimization: return only needed fields, use sparse fieldsets
Async processing: offload heavy work to background queues (Celery, SQS, Kafka)
Caching: cache at multiple levels (CDN, application, database query cache)

Measurement: requests per second (RPS) the system handles at acceptable latency. Measure under realistic load, not just peak.
Bottlenecks: CPU (computation, serialization), Memory (caching, buffers), I/O (database, network), Connection limits (pool exhaustion).
Load testing tools: k6 (modern, scriptable), Locust (Python-based), wrk (raw HTTP benchmarking), Apache Benchmark (simple tests). Run load tests before every major release.

CPU: serialization, TLS handshakes, compression, business logic. Monitor per-request CPU time. Profile hot paths.
Memory: connection state, caches, request buffers. Watch for memory leaks in long-running services. Set memory limits.
Connections: database connection pools, HTTP keep-alive connections. Set pool size limits. Monitor active vs idle connections.

Stateless services: easy to scale. Add more instances behind load balancer. Store session data externally (Redis, database).
Stateful services (WebSocket, gRPC streaming): need sticky sessions or shared state store (Redis Pub/Sub for WebSocket fan-out).
Database scaling:
Read replicas: route read queries to replicas, writes to primary
Sharding: partition data across multiple databases by key
Connection pooling: PgBouncer, ProxySQL to manage connection limits
Auto-scaling: scale based on CPU utilization, memory usage, request count, or queue depth. Set minimum and maximum instance counts.

Algorithms:
Round Robin: simple, even distribution across healthy instances
Least Connections: route to instance with fewest active requests
IP Hash: sticky sessions based on client IP
Weighted: assign different weights to servers with different capacity
Health checks: load balancer calls /health endpoint periodically. Remove unhealthy instances from rotation. Restore when healthy again.
gRPC consideration: HTTP/2 connections are long-lived and multiplexed. L4 (TCP) load balancing sends all requests on one connection to one server. Need L7 (application-layer) load balancing to distribute individual RPCs.

Exponential backoff: wait 1s, 2s, 4s, 8s between retries. Prevents thundering herd when many clients retry at the same time.
Jitter: add random delay to backoff (e.g., 1s + random(0, 1s)). Prevents synchronized retries from many clients hitting at once.
Max retries: set limit (3-5 retries). Don't retry forever. Log each retry for debugging.
Idempotency requirement: only retry idempotent operations safely (GET, PUT, DELETE). For POST, use idempotency keys so server detects duplicate requests.
Retry-After header: server tells client when to retry (seconds or date). Clients must respect this header. Common with 429 and 503 responses.

Circuit breaker: if service fails N times, stop calling it for M seconds. After timeout, try one test request. If success, close circuit. If fail, keep open. States: CLOSED (normal) -> OPEN (failing) -> HALF-OPEN (testing).
Timeout: set timeout on every external call. Default 5 seconds. Adjust per service based on expected response time. Never use infinite timeout.
Bulkhead: isolate failures. One slow dependency shouldn't block all requests. Use separate thread pools or connection pools per external service.
Fallback: if service is down, return cached data or default values. Degraded but functional. Better than full failure.
Health checks: /health (liveness) and /ready (readiness) endpoints. Health = process is alive. Ready = can serve traffic (database connected, etc.). Kubernetes uses both to manage pod lifecycle.

Feature flags: disable non-critical features under high load. Example: disable recommendations but keep core search working.
Load shedding: reject low-priority requests when overloaded. Return 503 Service Unavailable with Retry-After header.
Priority queues: process critical requests first. Payment processing before analytics events.
Graceful shutdown: on SIGTERM, stop accepting new requests, finish in-flight requests (with timeout), then exit. Prevents dropped connections during deployments.