Microservice Production Readiness Checklist

Use before deploying any new service to production. Each section is a gate.

Observability

Logs written to STDOUT/STDERR only — no file writes, no sidecar needed
Structured JSON format with consistent field names across all services
Every log line has: timestamp, level, service, trace_id, span_id, message
Correlation ID (trace ID) propagated via headers and injected into every log
Log levels defined and documented: ERROR, WARN, INFO, DEBUG
DEBUG/TRACE disabled by default in production; switchable without redeploy
No PII in logs (emails, passwords, tokens, card numbers, SSN)
No secrets logged — even masked partials leak signal to attackers
Logs shipped to centralised aggregation (ELK, Loki, CloudWatch, Stackdriver)
Log retention policy defined (e.g. 30 days hot, 1 year cold)

RED metrics exposed for every endpoint: Rate (req/s), Errors (%), Duration (latency)
USE metrics for every resource: Utilisation, Saturation, Errors (CPU, memory, connections)
Latency measured in percentiles — p50, p95, p99 — not averages
Business metrics tracked: orders created, payments processed, emails sent
Metrics endpoint exposed in Prometheus format (/metrics) or pushed to aggregator
Metric names follow a consistent convention: service_domain_action_unit
Dashboards exist in Grafana / Datadog for all RED + business metrics

OpenTelemetry SDK integrated (unified standard for traces, metrics, logs)
Trace ID propagated via traceparent header (W3C standard) to all downstream calls
Spans created for: HTTP calls, DB queries, queue publish/consume, cache reads
Traces exported to backend: Jaeger, Tempo, Datadog, New Relic, X-Ray
Sampling strategy defined — 100% in staging, adaptive/tail in production

Liveness probe: returns 200 if process is alive, 500 if stuck (restart me)
Readiness probe: returns 200 only if service can handle traffic (remove from LB if not)
Startup probe configured for slow-starting services (avoid premature kill)
Health checks do NOT call downstream services — they only check local state

App handles SIGTERM: stops accepting new requests, finishes in-flight, then exits
Shutdown timeout <= pod termination grace period in Kubernetes
Queue consumers drain the current message before stopping

Timeout set on every outbound HTTP/gRPC call — no call without a deadline
Timeout value = 99th-percentile of the healthy dependency + 10–20% buffer
Retry only on idempotent operations; max 3 retries with exponential backoff + jitter
Circuit breaker opens after N consecutive failures; half-open after cool-down
Fallback defined for every external dependency: can the service partially succeed?
Rate limiting and backpressure implemented if service receives fan-in traffic

HPA (Kubernetes) or equivalent configured — scales on CPU, RPS, or custom metric
Minimum 2 replicas in production; PodDisruptionBudget prevents simultaneous kill
Resource requests and limits set for CPU and memory on every container
Load tested at expected peak + 2× headroom before going live

Service not exposed to public internet unless required; VPN or internal DNS otherwise
Authentication in place for all external-facing endpoints (JWT / OAuth2 / mTLS)
HTTPS only for external traffic; TLS between services where required
Secrets stored in secret manager (Vault, AWS Secrets Manager) — not in env files
Container runs as non-root user
Dependency vulnerability scan in CI (Snyk, Trivy, Dependabot)
GDPR: no PII stored without legal basis; data retention policy documented
Bot configured to auto-update dependencies (Renovate / Dependabot)

Unit tests cover domain logic; test coverage ≥ 70% (coverage number is a floor, not a goal)
Integration tests cover DB, queue, and cache interactions
Contract tests (PACT) for every service-to-service HTTP dependency
Smoke test runs after every deployment to verify basic availability
Load test baseline established; result documented in README

Service owns its own database — no shared DB across service boundaries
DB connection pool sized for expected concurrency; connection timeout set
Migrations automated (Flyway, Liquibase, Alembic); never manual SQL in production
Migrations run as separate pre-deploy step, not inside the app on startup
DB backup configured; restore tested at least once
Dead letter queue configured for every consumer; poisoned messages don't block queue
Encryption at rest enabled; encryption in transit enforced

Define before launch, review in post-mortems.

Question	If NO → action
Can you find any request in logs within 5 s?	Add correlation ID + structured logging
Can you see p99 latency on a dashboard right now?	Add RED metrics + Grafana dashboard
Do you get paged before users notice an outage?	Add alerting on error rate + availability
Does the service recover on its own after a dependency blip?	Add retry + circuit breaker
Can you deploy at 2 AM without fear?	Add smoke tests + rollback runbook