Observability

Observability means you can understand what the system is doing from its outputs. Three pillars: logs, metrics, traces.

14.1 Logging

Structured logging

JSON format, not plain text. Include key fields in every log entry: - timestamp, level, service, trace_id, request_id - method, path, status, duration_ms - user_id (where relevant)

Example:

{"ts":"2026-03-25T12:00:00Z","level":"info","service":"user-api","trace_id":"abc123",
 "request_id":"req-456","method":"POST","path":"/users","status":201,"duration_ms":45}

Log levels

DEBUG: detailed dev info (disabled in production).
INFO: normal operations.
WARNING: unexpected but handled (e.g. retried request).
ERROR: failure needing attention.
CRITICAL: system-level failure.

Sensitive data

Never log passwords, tokens, credit cards, PII. Mask: email: u***@example.com.

14.2 Metrics (RED method)

Track per service: - Rate: requests per second. - Error rate: % of requests returning 5xx. - Duration: p50, p95, p99 latency.

Also track: - Active connections (WebSocket servers, DB pool usage). - Queue depth (backpressure signal). - Business metrics: orders/min, signups/hour.

Tools: Prometheus + Grafana (open-source), Datadog (SaaS), CloudWatch (AWS).

14.3 Distributed Tracing

One user request may go through 5 services. A trace connects all of them.

Each request gets a unique trace_id at the edge.
Each service adds a span (operation name, start time, duration, status).
Spans form a tree. Parent span = full request. Child spans = individual operations.
Propagate trace context via HTTP headers (traceparent, W3C standard).

Tools: Jaeger, Zipkin, Tempo (open-source), Datadog APM (SaaS). Standard SDK: OpenTelemetry (vendor-neutral, use for all new projects).

14.4 Edge Observability

Observe the edge layer specifically:

API Gateway metrics

Request count per route.
Error rate per route.
Authentication failure rate.
Rate limit triggers.

Load Balancer metrics

Healthy vs. unhealthy backend count.
Connections per backend.
Request distribution evenness.
Backend response time (not just LB response time).