Scalability
10.1 Horizontal Scaling
Add more instances behind a load balancer. This is the primary scaling strategy for stateless services.
Stateless means: no session data stored in the server instance. All state lives in the database, cache, or client.
User → Load Balancer → [Instance 1]
→ [Instance 2]
→ [Instance 3]
Auto-scaling: scale based on CPU, memory, or request queue depth. Kubernetes HPA (Horizontal Pod Autoscaler) does this automatically.
10.2 Vertical Scaling
Give one instance more CPU or RAM. Simple but has limits. When you hit the hardware ceiling, you must scale horizontally. Do not rely only on vertical scaling for production systems.
10.3 Bottlenecks
Common bottlenecks under load:
| Bottleneck | Symptom | Solution |
|---|---|---|
| Database queries | High DB CPU, slow queries | Add indexes, query optimization, replicas |
| Connection pool exhausted | "too many connections" error | Increase pool size, use PgBouncer |
| N+1 queries | DB latency grows with list size | DataLoader / batch queries |
| CPU-bound processing | High CPU, slow responses | Cache results, offload to queue |
| Network | High latency to DB / upstream | Move services closer, CDN, pooling |
| Memory leaks | Memory grows over time | Profile, fix, add memory limits |
N+1 query pattern — what it looks like
GET /posts → SELECT * FROM posts (1 query)
For each post → SELECT * FROM users WHERE id = ? (N queries)
Fix: one joined query or DataLoader batch.
10.4 WebSocket Scaling
WebSocket connections are stateful — they stay on one server. They cannot be load-balanced freely.
Sticky sessions
LB routes the same client to the same server using IP hash or session cookie.
Problem: if one server has many heavy users, it gets overloaded while others are idle.
Pub/Sub for cross-server messaging
Client A (Server 1) sends message
→ Server 1 publishes to Redis Pub/Sub
→ Server 2 and Server 3 receive from Redis
→ They forward to their connected clients
| Broker | Best for |
|---|---|
| Redis Pub/Sub | Simple, fast, no persistence |
| Kafka | Persistent, ordered, high volume |
| NATS | Very low latency, lightweight |
10.5 Multi-region
Serve users from the nearest region to minimize latency.
Geo routing
DNS or load balancer routes requests to the nearest region. - Cloudflare: anycast routing. - AWS Route 53: latency-based routing.
Data replication
- Read replicas per region (PostgreSQL streaming replication).
- Writes go to the primary region — cross-region write latency: 100–300 ms.
- For global writes: use CRDTs or eventual consistency patterns.
Rule: decide which data must be globally consistent vs. eventually consistent per domain before choosing a replication strategy.
Key Rules
- Design services as stateless from the start. Retrofitting is expensive.
- Identify DB bottlenecks first before adding instances — extra instances do not fix slow queries.
- Use PgBouncer or equivalent connection pooling in front of every database.
- For WebSocket scale-out, always use a Pub/Sub broker; never rely on sticky sessions alone.
- Measure before scaling: use APM (Datadog, Grafana) to find the real bottleneck.