Most teams write threshold alerts: "page me if error rate > 5%." The problem: a 1% error rate sustained for three days consumes your entire monthly error budget, but never fires the alert. Burn rate alerts fix this.

The Concepts

SLO (Service Level Objective): A target for reliability. Example: 99.9% of requests succeed over 30 days.

Error budget: The allowable failure. 99.9% SLO = 0.1% errors = 43.2 minutes of downtime or 432 bad requests per 100,000 in 30 days.

Burn rate: How fast you're consuming the error budget. Burn rate 1 means you'll exhaust the budget exactly at the end of the window. Burn rate 10 means you'll exhaust it in 1/10th the time (3 days for a 30-day budget).

Why multiwindow: A spike that lasts 2 minutes at burn rate 100 consumes 0.2% of a 30-day budget — real but not page-worthy. A sustained burn rate of 5 for 6 hours is serious. Comparing a short and a long window separates spikes from sustained degradation.

Step 1: Define Your SLI (Service Level Indicator)

Start with what you're measuring. For an HTTP service, availability SLI:

promql

# SLI: ratio of successful requests
sum(rate(http_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

For an HTTP service, latency SLI (P99 < 500ms):

promql

# Ratio of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Pick one to start. Most teams begin with availability.

Step 2: Express the SLO

For a 99.9% availability SLO over 30 days:

Error rate threshold: 1 - 0.999 = 0.001 (0.1%)
Error budget in seconds: 30 * 24 * 3600 * 0.001 = 2592 seconds ≈ 43 minutes
Burn rate 1 error rate: 0.001 (the SLO target)

Step 3: Calculate Burn Rates for Alert Windows

The Google SRE Workbook recommends two pairs of windows:

Alert tier	Short window	Long window	Burn rate	Budget consumed
Critical (page)	5m	1h	14.4x	2% in 1h
Warning (ticket)	30m	6h	6x	5% in 6h

At burn rate 14.4 with a 99.9% SLO:

Error rate = 0.001 × 14.4 = 1.44%
You'd exhaust the 30-day budget in 30/14.4 ≈ 2 days

The 5m/1h pair fires when you have a current spike (5m) that's sustained (1h), preventing false positives from brief spikes.

Step 4: Write the Alert Rules

yaml

1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4  name: my-app-slo
5  namespace: production
6  labels:
7    release: kube-prometheus-stack
8spec:
9  groups:
10    - name: slo-my-app-availability
11      interval: 30s
12      rules:
13
14        # --- Recording rules for reuse ---
15
16        # Error ratio over 5 minutes
17        - record: job:http_error_ratio:rate5m
18          expr: |
19            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
20            /
21            sum(rate(http_requests_total{namespace="production"}[5m]))
22
23        # Error ratio over 30 minutes
24        - record: job:http_error_ratio:rate30m
25          expr: |
26            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[30m]))
27            /
28            sum(rate(http_requests_total{namespace="production"}[30m]))
29
30        # Error ratio over 1 hour
31        - record: job:http_error_ratio:rate1h
32          expr: |
33            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[1h]))
34            /
35            sum(rate(http_requests_total{namespace="production"}[1h]))
36
37        # Error ratio over 6 hours
38        - record: job:http_error_ratio:rate6h
39          expr: |
40            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[6h]))
41            /
42            sum(rate(http_requests_total{namespace="production"}[6h]))
43
44        # --- Alert rules ---
45
46        # Critical: fast burn — 2% of monthly budget in 1 hour
47        # Fires when 5m burn AND 1h burn are both above threshold
48        - alert: AvailabilitySLOBurnRateCritical
49          expr: |
50            job:http_error_ratio:rate5m  > (14.4 * 0.001)
51            and
52            job:http_error_ratio:rate1h  > (14.4 * 0.001)
53          for: 2m
54          labels:
55            severity: critical
56            slo: availability
57          annotations:
58            summary: "SLO burn rate critical — fast error budget exhaustion"
59            description: |
60              Error rate {{ $value | humanizePercentage }} over 5m window.
61              At this rate, the 30-day error budget will be exhausted in
62              {{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.
63              Runbook: https://wiki.internal/runbooks/availability-slo
64
65        # Warning: slow burn — 5% of monthly budget in 6 hours
66        - alert: AvailabilitySLOBurnRateWarning
67          expr: |
68            job:http_error_ratio:rate30m > (6 * 0.001)
69            and
70            job:http_error_ratio:rate6h  > (6 * 0.001)
71          for: 15m
72          labels:
73            severity: warning
74            slo: availability
75          annotations:
76            summary: "SLO burn rate elevated — error budget draining"
77            description: |
78              Error rate {{ $value | humanizePercentage }} over 30m window.
79              Current burn rate will exhaust budget in approximately
80              {{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.

Apply it:

bash

kubectl apply -f slo-rules.yaml

Step 5: Track Remaining Error Budget

Add a recording rule that computes remaining budget as a percentage:

yaml

1        # Remaining error budget (percent) over 30-day rolling window
2        - record: job:slo_error_budget_remaining:ratio
3          expr: |
4            1 - (
5              sum(increase(http_requests_total{namespace="production",status_code=~"5.."}[30d]))
6              /
7              sum(increase(http_requests_total{namespace="production"}[30d]))
8            ) / 0.001

A value of 1.0 means full budget. 0 means exhausted. Negative means you've overshot.

Alert when budget is nearly exhausted:

yaml

1        - alert: ErrorBudgetNearlyExhausted
2          expr: job:slo_error_budget_remaining:ratio < 0.1
3          labels:
4            severity: warning
5          annotations:
6            summary: "Error budget below 10%"
7            description: "Only {{ $value | humanizePercentage }} of the 30-day error budget remains"

Step 6: Build a Grafana SLO Dashboard

Panel 1: Current error rate vs. SLO threshold

promql

# Error rate (red line)
job:http_error_ratio:rate5m

# SLO threshold (green dashed line) — add as constant at 0.001

Panel 2: Burn rate over time

promql

job:http_error_ratio:rate1h / 0.001

Add thresholds at 1 (burn rate 1 = sustainable), 6 (warning), 14.4 (critical).

Panel 3: Remaining error budget

promql

job:slo_error_budget_remaining:ratio * 100

Stat panel with color thresholds: green > 50%, yellow > 10%, red ≤ 10%.

Step 7: Validate the Rules

Trigger a brief error spike to test:

bash

# If you have a test endpoint that returns 500s
for i in $(seq 1 100); do
  curl -s -o /dev/null http://my-app.production.svc.cluster.local/error
done

Watch the burn rate panels in Grafana. The critical alert has a for: 2m clause, so it won't fire for a 10-second spike — that's intentional.

Check Prometheus alerts at http://localhost:9090 → Alerts. You should see the rules in INACTIVE state normally, and PENDING/FIRING during a real degradation.

Choosing SLO Targets

Don't start with 99.99%. Start with what you can actually measure historically:

promql

# What was your actual availability over the last 30 days?
sum(increase(http_requests_total{namespace="production",status_code!~"5.."}[30d]))
/
sum(increase(http_requests_total{namespace="production"}[30d]))

If you've been at 99.7%, set the SLO at 99.5% for the first quarter. Tighten it as you improve reliability.

Official References

Site Reliability Engineering — Service Level Objectives — The original SRE book chapter on SLOs, SLIs, and error budgets (free online)
The Alerting on SLOs chapter (Workbook) — Google's SRE Workbook on burn rate alerts, the multi-window approach, and alert calibration
Prometheus Alerting Rules — Official Prometheus docs for writing alert rules, for duration, and labels
OpenSLO Specification — A vendor-neutral open standard for defining SLOs as code
Sloth — A tool for generating Prometheus SLO recording rules and alerts from a simple spec

Defining SLOs and Writing Burn Rate Alerts in Prometheus

Before you begin

The Concepts

Step 1: Define Your SLI (Service Level Indicator)

Step 2: Express the SLO

Step 3: Calculate Burn Rates for Alert Windows

Step 4: Write the Alert Rules

Step 5: Track Remaining Error Budget

Step 6: Build a Grafana SLO Dashboard

Step 7: Validate the Rules

Choosing SLO Targets

Official References

Struggling with this in production?