SLIs, SLOs, and SLAs
Reliability needs precise definitions to be useful. Saying a system should be "highly available" means different things to different people. SLIs, SLOs, and SLAs provide a framework for defining exactly what reliability means and how to measure it.
The Three Levels
Think of these as a hierarchy from measurement to commitment:
SLI (Service Level Indicator) is what you measure. It's a specific metric that reflects user experience — like request latency, error rate, or availability percentage.
SLO (Service Level Objective) is your target for that measurement. It defines "good enough" — for example, "99.9% of requests complete successfully."
SLA (Service Level Agreement) is a contract with consequences. It's a promise to customers with penalties if you fail — like refunds or service credits.
Not every service needs an SLA, but every service benefits from SLIs and SLOs.
Choosing Good SLIs
The best SLIs reflect what users actually experience. Four categories cover most needs:
Availability answers "Is the service up?" This might be the percentage of successful health checks or the ratio of successful requests to total requests.
Latency answers "How fast are responses?" Measure percentiles rather than averages — p95 latency (the slowest 5% of requests) reveals problems that averages hide.
Throughput answers "How many requests can we handle?" This matters for capacity planning and detecting overload conditions.
Error rate answers "What percentage of requests fail?" Distinguish between client errors (user mistakes) and server errors (your problems).
Setting Appropriate SLOs
SLOs should be ambitious but achievable. Setting a 99.999% availability target when you can only realistically achieve 99.9% creates frustration and meaningless metrics.
Consider this example:
SLI: Request latency (p95)
SLO: p95 latency < 200ms for 99.9% of the month
This means that 99.9% of the time, 95% of requests should complete in under 200 milliseconds. The remaining 0.1% is your error budget — room for maintenance, deployments, and unexpected issues.
SLAs and Business Commitments
SLAs turn SLOs into contractual obligations. They typically include:
- The specific metrics and thresholds
- How compliance is measured
- Consequences for missing targets (usually service credits)
- Exclusions (scheduled maintenance, customer-caused issues)
An SLA might state: "If monthly availability falls below 99.9%, customers receive a 10% credit on their bill."
Set SLAs more conservatively than internal SLOs. Your SLO might target 99.95% while your SLA promises 99.9%. This buffer protects against penalties during difficult months.