Rollback Strategies
Why Rollback Must Be Instant
Every deployment can fail. The question is how quickly you recover. A rollback plan designed in advance takes 30 seconds. A rollback improvised during an incident takes 30 minutes.
Rollback target: previous known-good version deployed within 2 minutes.
Instant Rollback — Blue-Green
The fastest rollback method. Switch traffic back to the previous environment.
Production traffic → [Green v2] (current, broken)
↓ flip load balancer
Production traffic → [Blue v1] (previous, healthy)
Switch time: seconds. Zero downtime. No redeployment needed.
# Kubernetes — patch the service to point to previous deployment
kubectl patch service api-service \
-p '{"spec":{"selector":{"version":"v1"}}}'
Version Rollback — Kubernetes
Kubernetes keeps a deployment history. Roll back to any previous revision:
# View history
kubectl rollout history deployment/api-service
# Roll back to previous
kubectl rollout undo deployment/api-service
# Roll back to specific revision
kubectl rollout undo deployment/api-service --to-revision=3
# Monitor rollback
kubectl rollout status deployment/api-service
Automated Rollback on Health Failure
# Kubernetes with Argo Rollouts — auto-rollback on metric breach
analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: api-service
successCondition: result[0] >= 0.99
failureLimit: 3
# if analysis fails 3 times → automatic rollback to previous version
Traffic Switching — Canary Rollback
During a canary deployment, abort and route all traffic back to v1:
# Argo Rollouts — abort canary
kubectl argo rollouts abort api-service
# Or via CI trigger
kubectl argo rollouts set weight api-service 0
# GitHub Actions — rollback job triggered by monitoring alert
rollback:
if: failure()
needs: deploy-canary
steps:
- run: kubectl argo rollouts abort ${{ env.SERVICE }}
- name: Notify
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text":"Rollback triggered for ${{ env.SERVICE }}"}'
Database Migration Rollback
Code rollback is easy. Database migration rollback is hard.
Strategy: Backwards-Compatible Migrations
Never write migrations that break the previous version of the code. Deploy in two phases:
Phase 1 — Expand (before code deploy):
-- Add new column, keep old one
ALTER TABLE users ADD COLUMN email_v2 VARCHAR(255);
Phase 2 — Contract (after new code is stable):
-- Remove old column only when new code is fully deployed and stable
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_v2 TO email;
Between Phase 1 and Phase 2, both old and new code versions work.
Never Write Destructive Migrations
-- Never in a single deploy:
ALTER TABLE users DROP COLUMN email; -- breaks old code instantly
ALTER TABLE orders RENAME COLUMN total TO amount; -- breaks old code instantly
Rollback Decision Matrix
| Situation | Action | Time |
|---|---|---|
| Error rate spike after deploy | Rollback immediately | < 2 min |
| Latency exceeds SLO | Rollback if not fixable in 15 min | < 2 min |
| One non-critical bug | Fix forward with hotfix | < 30 min |
| Data corruption | Rollback + DB restore | 30–60 min |
| Canary showing errors | Abort canary | < 1 min |
Rollback Runbook
Keep this documented and accessible during incidents:
## Rollback Runbook — API Service
### Kubernetes rollback (30 seconds)
kubectl rollout undo deployment/api-service -n production
kubectl rollout status deployment/api-service -n production
### Verify health after rollback
curl https://api.example.com/health
### Announce
Post in #incidents: "Rolled back api-service to previous version. Investigating."