SRE Handbook 2026: SLOs, Error Budgets, Runbooks Guide

Site reliability engineering gets operationally useful when three things are explicit: the user-facing objective, the allowed failure budget, and the exact runbook the responder follows at 3 a.m. This reference is designed as a broad 2026 desk-side guide for working engineers, not a theory chapter.

Core Takeaway

An SLO is a promise, an error budget is the cost envelope for breaking it, and an incident runbook is the minimum path from detection to stable recovery. If one of the three is vague, the whole reliability program turns noisy and political.

Quick Reference

Use the filter below to scan formulas, commands, alert patterns, and runbook elements. Press / to focus search.

SLO and SLI

An SLI is the measured signal. An SLO is the target for that signal over a time window.

Error Budget

For a 99.9% availability target over 30 days, the allowed failure is 0.1%, or about 43.2 minutes.

Burn-Rate Alerting

Alert on how fast the budget is being spent, not on every isolated spike. This reduces alert fatigue and aligns pages with risk.

Runbook Minimums

Every incident doc needs triggers, checks, rollback steps, owners, dependencies, and a clear stop condition.

Artifact Hygiene

Mask customer payloads before sharing screenshots, logs, or postmortem snippets. The Data Masking Tool is useful for redacting examples fast.

SLO Basics

Start with user journeys, not infrastructure counters. A queue depth spike matters only if it degrades something a user experiences, such as successful checkout, API latency, or page render completeness.

Availability SLI: fraction of valid requests that return an acceptable response.
Latency SLI: fraction of requests finishing below a threshold such as 300 ms or 1 s.
Quality SLI: fraction of outputs that meet correctness or freshness expectations.
Durability SLI: fraction of writes preserved and retrievable within the recovery promise.

Useful 2026 rule of thumb: avoid setting more than one page-worthy SLO per critical journey unless each target drives a distinct operational action. Multiple SLOs on the same path often create duplicate pages with no extra insight.

Formula

error_budget = 1 - slo_target
allowed_bad_events = total_events * error_budget
allowed_downtime = window_duration * error_budget

Error Budget Math

The budget makes tradeoffs concrete. Teams can spend it on controlled risk like releases, migrations, or experiments, but if the budget is exhausted, feature velocity should slow until reliability recovers.

99.0% monthly availability allows about 7h 18m of downtime.
99.9% monthly availability allows about 43.2m.
99.95% monthly availability allows about 21.6m.
99.99% monthly availability allows about 4.32m.

Burn rate answers a different question: if the current failure rate continues, how quickly will the team consume the entire budget? That is why it is more actionable than a raw error threshold.

PromQL Example

# 5xx ratio over 5 minutes
sum(rate(http_requests_total{job='api',status=~'5..'}[5m]))
/
sum(rate(http_requests_total{job='api'}[5m]))

# Conceptual burn rate
current_error_rate / error_budget_fraction

For paging, a multi-window multi-burn-rate approach remains a practical default: one fast window to catch acute failures and one slower window to catch smoldering regressions. Keep the math transparent enough that the on-call engineer can verify it during an incident.

Incident Runbooks

A runbook is not a prose essay. It is an operational decision path. If a responder has to infer the next step from tribal knowledge, the runbook is incomplete.

Minimum Structure

Trigger: what page, threshold, or symptom starts this runbook.
Scope Check: which services, tenants, or regions are affected.
Safety Checks: what must be confirmed before making changes.
Mitigation Steps: actions in the correct order, including rollback.
Escalation: when to bring in platform, database, security, or vendor owners.
Exit Criteria: what evidence proves the system is stable enough to close the incident.

Runbook Template

title: API elevated 5xx
severity: SEV-2
trigger:
  alert: api_high_burn_rate
  threshold: 14.4x budget burn over 5m
scope_check:
  - confirm affected regions
  - confirm affected endpoints
  - compare canary vs stable
mitigation:
  - pause active rollout
  - shift traffic away from failing region
  - rollback last config change
verification:
  - 5xx ratio below SLO threshold for 15m
  - p95 latency back within target
escalation:
  - database on-call if error source is storage
  - security on-call if suspicious traffic pattern appears
exit_criteria:
  - customer impact stopped
  - alert cleared
  - follow-up issue created

Keep evidence capture separate from the main mitigation path. During a live incident, responders should not waste time formatting logs. If you need to share payload samples in chat or a postmortem, redact them first with the Data Masking Tool.

Commands by Purpose

Discover

Discovery Commands

kubectl get deploy,po -A
kubectl top pod -A
kubectl get events -A --sort-by=.lastTimestamp
curl -sS https://status.example.internal/health
jq '.status,.version' service-status.json

Diagnose

Diagnostic Commands

kubectl logs deploy/api -n prod --since=15m
kubectl describe pod api-12345 -n prod
promtool query instant http://prometheus:9090 'up{job="api"}'
ss -tulpn
journalctl -u api.service --since '15 minutes ago'

Mitigate

Mitigation Commands

kubectl rollout pause deploy/api -n prod
kubectl rollout undo deploy/api -n prod
kubectl scale deploy/api -n prod --replicas=12
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch file://failover.json

Recover and Verify

Recovery Commands

kubectl rollout status deploy/api -n prod
watch -n 10 'curl -fsS https://api.example.com/ready || echo not-ready'
promtool query range http://prometheus:9090 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))' --start=START --end=END --step=30s

If you publish shell snippets internally, clean them up for readability before sharing. A formatting pass with the Code Formatter makes operational docs easier to scan under pressure.

Configuration

Configuration should tie together the measurement layer, alerting policy, and ownership model. Keep labels aligned across metrics, dashboards, alerts, and paging routes, or incident response will slow down on basic lookup work.

Example SLO Config

service: checkout-api
journey: create-order
sli:
  type: availability
  good_events: http_requests_total{status!~"5.."}
  total_events: http_requests_total
slo:
  target: 0.999
  window: 30d
alerting:
  page_on:
    - short_window_burn_rate: 14.4
    - long_window_burn_rate: 6
ownership:
  team: commerce-platform
  slack: '#commerce-oncall'
  runbook: https://internal.example/runbooks/checkout-api

Use one canonical service name everywhere.
Define excluded traffic explicitly, such as synthetic probes or staff-only endpoints.
Record the runbook URL inside the alert payload.
Route pages by owning team, not by who last changed the code.

Advanced Usage

Good Event Modeling

Counting all non-5xx responses as success is often too blunt. In 2026 production environments, many teams now treat semantic failures such as empty search results from a broken index or stale recommendation payloads as bad events when they materially degrade the journey.

Composite Views

For multi-service flows, keep user-facing SLOs separate from team-local component SLOs. The user only experiences the composed path. Engineering teams, however, still need component-level objectives to isolate which dependency is consuming the budget.

Release Policy

Connect deployment policy to budget state. Examples include blocking high-risk launches when monthly burn exceeds a threshold or requiring additional approvers when the rolling seven-day error budget is nearly exhausted.

Runbook Quality Checks

Each mitigation step should be reversible.
Every branch should identify the next observable check.
Escalation criteria should be objective, not social.
Exit criteria should name a time window, not just a green dashboard.

Keyboard Shortcuts

These shortcuts work on this page to speed up navigation through the reference.

Shortcut	Action
/	Focus the live search filter.
1	Jump to SLO Basics.
2	Jump to Error Budget Math.
3	Jump to Incident Runbooks.
4	Jump to Commands by Purpose.
t	Scroll to the top quick-jump bar.