Testing in Production
Why Test in Production
Staging is not production. No matter how well staging mirrors production, the real environment has real data volumes, real traffic patterns, and real edge cases that staging never sees.
Testing in production means running safe, non-destructive verification against the live system after every deployment.
Smoke Tests
Minimal set of tests run immediately after every deployment. Must complete in under 2 minutes. Must test the critical path only.
# tests/smoke/test_critical_path.py
import pytest
import requests
BASE_URL = os.environ["SMOKE_BASE_URL"]
@pytest.mark.smoke
def test_health_endpoint():
response = requests.get(f"{BASE_URL}/health", timeout=5)
assert response.status_code == 200
assert response.json()["status"] == "ok"
@pytest.mark.smoke
def test_auth_endpoint_reachable():
response = requests.post(
f"{BASE_URL}/auth/token",
json={"username": "smoke@example.com", "password": "wrong"},
timeout=5,
)
assert response.status_code == 401 # reachable, returns expected error
@pytest.mark.smoke
def test_api_returns_json():
response = requests.get(f"{BASE_URL}/products?limit=1", timeout=5)
assert response.status_code == 200
assert isinstance(response.json(), (list, dict))
# In deploy pipeline
- name: Smoke tests
run: |
uv run pytest -m smoke --timeout=30 -v
env:
SMOKE_BASE_URL: ${{ vars.PRODUCTION_URL }}
Smoke test failure → automatic rollback trigger.
Synthetic Monitoring
Continuously run test scenarios against production on a schedule. Not after deploys only — 24/7, every 1–5 minutes.
# GitHub Actions — scheduled synthetic monitor
name: Synthetic Monitoring
on:
schedule:
- cron: "*/5 * * * *" # every 5 minutes
jobs:
monitor:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- name: Run synthetic checks
run: |
uv run pytest tests/synthetic/ \
--timeout=15 -q
env:
MONITOR_URL: ${{ vars.PRODUCTION_URL }}
MONITOR_TOKEN: ${{ secrets.MONITOR_TOKEN }}
- name: Alert on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text":":rotating_light: Synthetic monitor failed: ${{ github.run_url }}"}'
env:
SLACK_WEBHOOK_URL: ${{ secrets.ONCALL_WEBHOOK }}
Synthetic monitoring catches: - External dependencies that went down - DNS or certificate issues - Configuration drift in production - Gradual performance degradation between deploys
Canary Validation
During canary deployment, validate that the canary version meets SLOs before routing more traffic to it.
# Argo Rollouts analysis template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 2m
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
service="api",status=~"2.."
}[2m])) /
sum(rate(http_requests_total{service="api"}[2m]))
If the success rate drops below 99% three times in a row → canary aborted, rollback.
Observability After Deploy
The 10 minutes after a production deploy are the highest-risk window. Watch these in real time:
Deploy complete
↓
T+0m: Error rate baseline check (should be 0%)
T+2m: p95 latency check (should be within SLO)
T+5m: Memory trend (should be stable)
T+10m: DB connection pool (should not be growing)
T+30m: No memory creep → deployment healthy
Testing in Production Rules
| Rule | Reason |
|---|---|
| Tests must be read-only | Never write to production DB in tests |
| Use dedicated test accounts | Isolate test traffic from real users |
| Rate limit test requests | Prevent smoke tests from adding load |
| Never test payment flows for real | Use sandbox mode or skip |
| Log all synthetic requests | Distinguish from real traffic in metrics |
Mark synthetic traffic so it can be filtered from user-facing metrics:
headers = {
"X-Synthetic-Request": "true",
"User-Agent": "SyntheticMonitor/1.0",
}