SLA / SLO
Definitions
SLA — Service Level Agreement
An external contract with customers or partners. Defines the minimum acceptable performance. Breach has legal or financial consequences.
"We guarantee 99.9% uptime and p95 response time < 500ms"
SLO — Service Level Objective
An internal target set by engineering. Must be stricter than the SLA — it is the team's goal, not the floor.
SLA commitment: p95 < 500ms
SLO target: p95 < 300ms ← engineers aim here
If the SLO is breached, the team acts before the SLA is violated.
SLI — Service Level Indicator
The measurement used to evaluate SLO compliance.
| SLI | What It Measures |
|---|---|
| Request latency (p95) | Speed |
| Error rate | Reliability |
| Availability | Uptime |
| Throughput | Capacity |
Practical SLO Examples
| System | SLO |
|---|---|
| REST API (read) | p95 < 200ms, error rate < 0.1% |
| REST API (write) | p95 < 500ms, error rate < 0.5% |
| Background job | p99 < 30s, success rate > 99% |
| WebSocket stream | Connection drop rate < 0.1% |
| Search endpoint | p95 < 300ms, p99 < 1000ms |
Error Budget
Error budget = the allowed amount of SLO violations before action is required.
SLO target: 99.9% requests < 300ms
Error budget: 0.1% of requests can exceed 300ms
At 1000 RPS: 1 request/second can be slow
When the error budget is consumed, stop feature work, fix reliability.
Using SLOs in Load Tests
Define SLO pass/fail criteria in Locust:
from locust import events
from locust.env import Environment
@events.quitting.add_listener
def assert_slo(environment: Environment, **kwargs: object) -> None:
stats = environment.runner.stats.total
p95 = stats.get_response_time_percentile(0.95)
error_rate = stats.fail_ratio
if p95 > 300:
environment.process_exit_code = 1
print(f"SLO BREACH: p95={p95}ms exceeds 300ms threshold")
if error_rate > 0.01:
environment.process_exit_code = 1
print(f"SLO BREACH: error_rate={error_rate:.1%} exceeds 1%")
This makes performance tests act as automated SLO gates in CI/CD.