Defining SLOs and Writing Burn Rate Alerts in Prometheus
Turn a reliability goal into an alerting rule. This tutorial shows you how to express an SLO as a Prometheus query, calculate burn rates, and write multiwindow alerts that page you before users notice.
Before you begin
- Prometheus running with application metrics
- Basic PromQL knowledge
- PrometheusRule CRD (kube-prometheus-stack or raw Prometheus)
Most teams write threshold alerts: "page me if error rate > 5%." The problem: a 1% error rate sustained for three days consumes your entire monthly error budget, but never fires the alert. Burn rate alerts fix this.
The Concepts
SLO (Service Level Objective): A target for reliability. Example: 99.9% of requests succeed over 30 days.
Error budget: The allowable failure. 99.9% SLO = 0.1% errors = 43.2 minutes of downtime or 432 bad requests per 100,000 in 30 days.
Burn rate: How fast you're consuming the error budget. Burn rate 1 means you'll exhaust the budget exactly at the end of the window. Burn rate 10 means you'll exhaust it in 1/10th the time (3 days for a 30-day budget).
Why multiwindow: A spike that lasts 2 minutes at burn rate 100 consumes 0.2% of a 30-day budget — real but not page-worthy. A sustained burn rate of 5 for 6 hours is serious. Comparing a short and a long window separates spikes from sustained degradation.
Step 1: Define Your SLI (Service Level Indicator)
Start with what you're measuring. For an HTTP service, availability SLI:
# SLI: ratio of successful requests
sum(rate(http_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))For an HTTP service, latency SLI (P99 < 500ms):
# Ratio of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))Pick one to start. Most teams begin with availability.
Step 2: Express the SLO
For a 99.9% availability SLO over 30 days:
- Error rate threshold: 1 - 0.999 = 0.001 (0.1%)
- Error budget in seconds: 30 * 24 * 3600 * 0.001 = 2592 seconds ≈ 43 minutes
- Burn rate 1 error rate: 0.001 (the SLO target)
Step 3: Calculate Burn Rates for Alert Windows
The Google SRE Workbook recommends two pairs of windows:
| Alert tier | Short window | Long window | Burn rate | Budget consumed |
|---|---|---|---|---|
| Critical (page) | 5m | 1h | 14.4x | 2% in 1h |
| Warning (ticket) | 30m | 6h | 6x | 5% in 6h |
At burn rate 14.4 with a 99.9% SLO:
- Error rate = 0.001 × 14.4 = 1.44%
- You'd exhaust the 30-day budget in 30/14.4 ≈ 2 days
The 5m/1h pair fires when you have a current spike (5m) that's sustained (1h), preventing false positives from brief spikes.
Step 4: Write the Alert Rules
1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4 name: my-app-slo
5 namespace: production
6 labels:
7 release: kube-prometheus-stack
8spec:
9 groups:
10 - name: slo-my-app-availability
11 interval: 30s
12 rules:
13
14 # --- Recording rules for reuse ---
15
16 # Error ratio over 5 minutes
17 - record: job:http_error_ratio:rate5m
18 expr: |
19 sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
20 /
21 sum(rate(http_requests_total{namespace="production"}[5m]))
22
23 # Error ratio over 30 minutes
24 - record: job:http_error_ratio:rate30m
25 expr: |
26 sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[30m]))
27 /
28 sum(rate(http_requests_total{namespace="production"}[30m]))
29
30 # Error ratio over 1 hour
31 - record: job:http_error_ratio:rate1h
32 expr: |
33 sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[1h]))
34 /
35 sum(rate(http_requests_total{namespace="production"}[1h]))
36
37 # Error ratio over 6 hours
38 - record: job:http_error_ratio:rate6h
39 expr: |
40 sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[6h]))
41 /
42 sum(rate(http_requests_total{namespace="production"}[6h]))
43
44 # --- Alert rules ---
45
46 # Critical: fast burn — 2% of monthly budget in 1 hour
47 # Fires when 5m burn AND 1h burn are both above threshold
48 - alert: AvailabilitySLOBurnRateCritical
49 expr: |
50 job:http_error_ratio:rate5m > (14.4 * 0.001)
51 and
52 job:http_error_ratio:rate1h > (14.4 * 0.001)
53 for: 2m
54 labels:
55 severity: critical
56 slo: availability
57 annotations:
58 summary: "SLO burn rate critical — fast error budget exhaustion"
59 description: |
60 Error rate {{ $value | humanizePercentage }} over 5m window.
61 At this rate, the 30-day error budget will be exhausted in
62 {{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.
63 Runbook: https://wiki.internal/runbooks/availability-slo
64
65 # Warning: slow burn — 5% of monthly budget in 6 hours
66 - alert: AvailabilitySLOBurnRateWarning
67 expr: |
68 job:http_error_ratio:rate30m > (6 * 0.001)
69 and
70 job:http_error_ratio:rate6h > (6 * 0.001)
71 for: 15m
72 labels:
73 severity: warning
74 slo: availability
75 annotations:
76 summary: "SLO burn rate elevated — error budget draining"
77 description: |
78 Error rate {{ $value | humanizePercentage }} over 30m window.
79 Current burn rate will exhaust budget in approximately
80 {{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.Apply it:
kubectl apply -f slo-rules.yamlStep 5: Track Remaining Error Budget
Add a recording rule that computes remaining budget as a percentage:
1 # Remaining error budget (percent) over 30-day rolling window
2 - record: job:slo_error_budget_remaining:ratio
3 expr: |
4 1 - (
5 sum(increase(http_requests_total{namespace="production",status_code=~"5.."}[30d]))
6 /
7 sum(increase(http_requests_total{namespace="production"}[30d]))
8 ) / 0.001A value of 1.0 means full budget. 0 means exhausted. Negative means you've overshot.
Alert when budget is nearly exhausted:
1 - alert: ErrorBudgetNearlyExhausted
2 expr: job:slo_error_budget_remaining:ratio < 0.1
3 labels:
4 severity: warning
5 annotations:
6 summary: "Error budget below 10%"
7 description: "Only {{ $value | humanizePercentage }} of the 30-day error budget remains"Step 6: Build a Grafana SLO Dashboard
Panel 1: Current error rate vs. SLO threshold
# Error rate (red line)
job:http_error_ratio:rate5m
# SLO threshold (green dashed line) — add as constant at 0.001Panel 2: Burn rate over time
job:http_error_ratio:rate1h / 0.001Add thresholds at 1 (burn rate 1 = sustainable), 6 (warning), 14.4 (critical).
Panel 3: Remaining error budget
job:slo_error_budget_remaining:ratio * 100Stat panel with color thresholds: green > 50%, yellow > 10%, red ≤ 10%.
Step 7: Validate the Rules
Trigger a brief error spike to test:
# If you have a test endpoint that returns 500s
for i in $(seq 1 100); do
curl -s -o /dev/null http://my-app.production.svc.cluster.local/error
doneWatch the burn rate panels in Grafana. The critical alert has a for: 2m clause, so it won't fire for a 10-second spike — that's intentional.
Check Prometheus alerts at http://localhost:9090 → Alerts. You should see the rules in INACTIVE state normally, and PENDING/FIRING during a real degradation.
Choosing SLO Targets
Don't start with 99.99%. Start with what you can actually measure historically:
# What was your actual availability over the last 30 days?
sum(increase(http_requests_total{namespace="production",status_code!~"5.."}[30d]))
/
sum(increase(http_requests_total{namespace="production"}[30d]))If you've been at 99.7%, set the SLO at 99.5% for the first quarter. Tighten it as you improve reliability.
Official References
- Site Reliability Engineering — Service Level Objectives — The original SRE book chapter on SLOs, SLIs, and error budgets (free online)
- The Alerting on SLOs chapter (Workbook) — Google's SRE Workbook on burn rate alerts, the multi-window approach, and alert calibration
- Prometheus Alerting Rules — Official Prometheus docs for writing alert rules,
forduration, and labels - OpenSLO Specification — A vendor-neutral open standard for defining SLOs as code
- Sloth — A tool for generating Prometheus SLO recording rules and alerts from a simple spec
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.