Error Budgets

Error budgets transform reliability from a vague goal into a concrete resource you can spend. Instead of arguing about whether it's "safe" to deploy, teams can look at objective data. This simple concept resolves one of the oldest tensions in software — the conflict between moving fast and staying stable.

The Error Budget Concept

If your SLO is 99.9% availability, you're accepting that 0.1% unavailability is okay. That 0.1% is your error budget — the amount of unreliability you've decided users can tolerate.

Let's calculate what this means in practice:

SLO: 99.9% availability
Error budget: 0.1% downtime

Per month (30 days):
0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes

You can "spend" 43 minutes of downtime per month.

This budget covers everything: planned maintenance, deployments that cause brief outages, unexpected failures, and incidents. Once it's spent, it's gone until the next measurement period.

Spending Your Budget

Error budgets get consumed by any reliability impact. A deployment that causes 5 minutes of errors uses 5 minutes of budget. An incident that takes 20 minutes to resolve uses 20 minutes. Planned maintenance during low-traffic hours still counts.

The key insight is that unused error budget represents opportunity. If you end every month with budget remaining, you could be shipping features faster. Perfect reliability means you're probably over-investing in stability at the expense of progress.

Budget-Based Decisions

Error budgets create clear decision criteria:

Budget > 50% remaining: Ship freely
Budget 25-50% remaining: Ship with caution
Budget < 25% remaining: Focus on reliability
Budget exhausted: Freeze non-critical changes

When budget runs low, the team shifts priorities. Instead of shipping new features, engineers work on reliability improvements — better monitoring, automated recovery, or fixing fragile components.

This isn't punishment. It's a rational response to data showing the system needs attention before it can safely absorb more change.

Aligning Team Incentives

Error budgets solve the classic conflict between development and operations. Without them, developers want to ship constantly while operations wants to freeze everything. Both positions are reasonable given their different responsibilities.

With error budgets, everyone shares the same goal: spend the budget wisely. Developers care about reliability because poor reliability blocks their features. Operations supports shipping because unused budget means missed opportunities.

The budget creates a shared language. "We have 30 minutes of budget left" is more useful than "I feel nervous about this deployment."

See More

Further Reading

You need to be signed in to leave a comment and join the discussion