Traffic Management
4.1 Routing
Path-based routing
Route requests based on URL path:
/api/v1/users→ user service/api/v1/orders→ order service/static/*→ CDN or file service
Each path prefix maps to exactly one upstream target. The gateway matches the longest prefix first.
Host-based routing
Route based on the Host HTTP header:
api.example.com→ API clusteradmin.example.com→ admin clusterws.example.com→ WebSocket cluster
Useful when multiple products or tenants share the same IP address. The gateway reads the hostname and forwards to the correct cluster before any application code runs.
4.2 Traffic Shaping
Rate limiting
Limit the number of requests a client can make in a time window. Client is identified by API key, user ID, or IP address.
When the limit is exceeded, return 429 Too Many Requests with a Retry-After header.
Common algorithms:
| Algorithm | How it works | Best for |
|---|---|---|
| Fixed window | Count resets every N seconds | Simple cases |
| Sliding window | Count over a rolling time range | Smoother enforcement |
| Token bucket | Add tokens at a fixed rate; each request consumes one | Burst tolerance |
| Leaky bucket | Queue requests and drain at fixed rate | Strict output rate |
Apply rate limiting at the API gateway level — before requests reach backend services. This protects services from overload and abuse without any code changes in services.
Throttling
Throttling is softer than rate limiting. Instead of rejecting excess requests, it slows them down.
- Excess requests enter a queue and are processed with a delay.
- The client waits but eventually gets a response.
- Use when short bursts are acceptable but sustained overload is not.
When to choose throttling vs rate limiting: Use rate limiting when you want hard limits (pay-per-use APIs, abuse prevention). Use throttling when you want smooth flow control with no client errors during mild spikes.
4.3 Canary Releases
Gradually roll out a new version by routing a small percentage of traffic to it.
All traffic → Load Balancer
95% → Service v1 (stable)
5% → Service v2 (canary)
Process: 1. Deploy v2 alongside v1. Route 5% of traffic to v2. 2. Monitor error rate, latency, and business metrics on v2. 3. If healthy: increase percentage — 5% → 25% → 50% → 100%. 4. If errors spike: route all traffic back to v1 immediately.
Implementation options:
- NGINX upstream with weight directive
- Envoy weighted clusters
- AWS ALB weighted target groups
- Istio VirtualService with traffic weights
Canary is the preferred strategy when risk reduction matters more than speed of rollout.
4.4 Blue-Green Deployment
Two identical production environments: Blue (current) and Green (new version).
Flow: 1. Green is deployed and tested fully while Blue serves all traffic. 2. Switch: the load balancer routes all traffic from Blue to Green. Instant cutover. 3. Blue stays idle as a rollback target. 4. If issues appear, switch back to Blue in seconds.
Zero downtime. No gradual rollout — it is all-or-nothing.
vs Canary:
| Aspect | Blue-Green | Canary |
|---|---|---|
| Rollout speed | Instant | Gradual |
| Risk exposure | All users at once | Small group first |
| Rollback speed | Instant | Instant |
| Infrastructure cost | 2x during switch | Low |
Use Blue-Green when you need fast rollback and can afford 2x infrastructure briefly. Use Canary when you want to validate with real users before full rollout.
4.5 Failover
Active-passive
- Primary server handles all traffic.
- Standby server is idle, ready to take over.
- On primary failure: load balancer detects via health check and routes all traffic to standby.
- Standby needs 10–60 seconds to become fully active.
Trade-offs: Lower cost (standby does nothing), but brief downtime during failover. Use when cost matters more than zero-downtime availability.
Active-active
- Multiple servers handle traffic simultaneously.
- On one server failure: remaining servers absorb the load automatically.
- No failover delay — traffic is already distributed.
Trade-offs: Higher cost (all servers active), but no downtime. Use when your SLA requires continuous availability and no tolerance for even brief outages.
Key decision factor: Does your system tolerate 10–60 seconds of downtime? If yes, active-passive. If no, active-active.