Monitoring Kubernetes with Prometheus and Grafana
Deploy the kube-prometheus-stack with Helm, understand what it collects out of the box, build a dashboard for your application, and set up your first alert rule — all in under an hour.
Before you begin
- A running Kubernetes cluster
- Helm 3 installed
- kubectl configured
- At least 2 CPU and 4Gi memory available in the cluster
You don't need to configure Prometheus from scratch. The kube-prometheus-stack Helm chart deploys Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alert rules that cover the entire Kubernetes stack — nodes, pods, deployments, PVCs, and more.
This tutorial gets you from zero to a working monitoring stack, then shows you how to add your own application metrics.
Step 1: Install the kube-prometheus-stack
1helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
2helm repo update
3
4helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
5 --namespace monitoring \
6 --create-namespace \
7 --set grafana.adminPassword=changeme \
8 --set prometheus.prometheusSpec.retention=15d \
9 --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20GiThis deploys:
- Prometheus (metrics collection and storage)
- Grafana (dashboards and visualisation)
- Alertmanager (alert routing and deduplication)
- kube-state-metrics (exposes Kubernetes object state as metrics)
- node-exporter (exposes host-level metrics: CPU, memory, disk, network)
Wait for everything to start:
kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=180s
kubectl get pods -n monitoringStep 2: Access Grafana
Forward Grafana's port locally:
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoringOpen http://localhost:3000. Log in with admin / changeme.
Navigate to Dashboards → Browse. You'll find 30+ pre-built dashboards:
- Kubernetes / Cluster — overall cluster health
- Kubernetes / Nodes — per-node CPU, memory, disk
- Kubernetes / Pods — per-pod resource usage
- Kubernetes / Workloads — deployment/daemonset/statefulset status
Step 3: Access Prometheus
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoringOpen http://localhost:9090. This is Prometheus's built-in query UI.
Try a few queries:
1# CPU usage per pod (5-minute average)
2rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
3
4# Memory usage per pod
5container_memory_working_set_bytes{namespace="production"}
6
7# Number of ready replicas per deployment
8kube_deployment_status_replicas_ready{namespace="production"}
9
10# Pod restart count
11increase(kube_pod_container_status_restarts_total[1h])Step 4: Instrument Your Application
To expose custom metrics from your application, use a Prometheus client library.
Node.js (prom-client):
1const client = require('prom-client');
2const register = new client.Registry();
3
4// Counter: total HTTP requests
5const httpRequestsTotal = new client.Counter({
6 name: 'http_requests_total',
7 help: 'Total number of HTTP requests',
8 labelNames: ['method', 'status_code'],
9 registers: [register]
10});
11
12// Histogram: request duration
13const httpRequestDuration = new client.Histogram({
14 name: 'http_request_duration_seconds',
15 help: 'HTTP request duration in seconds',
16 labelNames: ['method', 'route', 'status_code'],
17 buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
18 registers: [register]
19});
20
21// Expose metrics endpoint
22app.get('/metrics', async (req, res) => {
23 res.set('Content-Type', register.contentType);
24 res.end(await register.metrics());
25});
26
27// Instrument a route
28app.use((req, res, next) => {
29 const end = httpRequestDuration.startTimer({
30 method: req.method,
31 route: req.path
32 });
33 res.on('finish', () => {
34 httpRequestsTotal.inc({ method: req.method, status_code: res.statusCode });
35 end({ status_code: res.statusCode });
36 });
37 next();
38});Go (prometheus/client_golang):
1import (
2 "github.com/prometheus/client_golang/prometheus"
3 "github.com/prometheus/client_golang/prometheus/promauto"
4 "github.com/prometheus/client_golang/prometheus/promhttp"
5)
6
7var (
8 httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
9 Name: "http_requests_total",
10 Help: "Total number of HTTP requests",
11 }, []string{"method", "status_code"})
12
13 httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
14 Name: "http_request_duration_seconds",
15 Help: "HTTP request duration",
16 Buckets: []float64{0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5},
17 }, []string{"method", "route"})
18)
19
20// In your main():
21http.Handle("/metrics", promhttp.Handler())Step 5: Tell Prometheus to Scrape Your App
Create a ServiceMonitor — a CRD that kube-prometheus-stack uses to configure scraping:
1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4 name: my-app
5 namespace: production
6 labels:
7 release: kube-prometheus-stack # Must match the Helm release label
8spec:
9 selector:
10 matchLabels:
11 app: my-app
12 endpoints:
13 - port: http
14 path: /metrics
15 interval: 15skubectl apply -f servicemonitor.yamlYour application's Service must have a port named http (or whatever you specify in endpoints.port). Verify Prometheus is scraping it at http://localhost:9090 → Status → Targets.
Step 6: Build a Grafana Dashboard for Your App
In Grafana, click the + icon → Dashboard → Add visualization.
Panel 1: Request rate
sum(rate(http_requests_total{namespace="production"}[2m])) by (status_code)Set visualization type: Time series. Set legend to {{status_code}}.
Panel 2: P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m])) by (le, route)
)Panel 3: Error rate (5xx)
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[2m]))
/
sum(rate(http_requests_total{namespace="production"}[2m]))Set threshold to 0.01 (1% error rate = red).
Save the dashboard. Click Share → Export → save the JSON to your repo to version it.
Step 7: Create an Alert Rule
Alert when error rate exceeds 1% for 5 minutes:
1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4 name: my-app-alerts
5 namespace: production
6 labels:
7 release: kube-prometheus-stack
8spec:
9 groups:
10 - name: my-app
11 interval: 30s
12 rules:
13 - alert: HighErrorRate
14 expr: |
15 sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
16 /
17 sum(rate(http_requests_total{namespace="production"}[5m]))
18 > 0.01
19 for: 5m
20 labels:
21 severity: warning
22 annotations:
23 summary: "High 5xx error rate on my-app"
24 description: "Error rate is {{ $value | humanizePercentage }} — investigate logs"
25
26 - alert: PodCrashLooping
27 expr: |
28 increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 3
29 for: 0m
30 labels:
31 severity: critical
32 annotations:
33 summary: "Pod {{ $labels.pod }} is crash looping"kubectl apply -f prometheus-rules.yamlCheck the rule at http://localhost:9090 → Alerts. It should appear in INACTIVE state (no firing yet).
Step 8: Configure Alertmanager
By default, Alertmanager doesn't route alerts anywhere. Configure Slack notifications:
1apiVersion: v1
2kind: Secret
3metadata:
4 name: alertmanager-kube-prometheus-stack
5 namespace: monitoring
6stringData:
7 alertmanager.yaml: |
8 global:
9 slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
10
11 route:
12 group_by: [alertname, namespace]
13 group_wait: 30s
14 group_interval: 5m
15 repeat_interval: 4h
16 receiver: slack-alerts
17 routes:
18 - match:
19 severity: critical
20 receiver: slack-critical
21
22 receivers:
23 - name: slack-alerts
24 slack_configs:
25 - channel: "#alerts"
26 title: "{{ .CommonAnnotations.summary }}"
27 text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
28
29 - name: slack-critical
30 slack_configs:
31 - channel: "#oncall"
32 title: "CRITICAL: {{ .CommonAnnotations.summary }}"kubectl apply -f alertmanager-config.yaml
# Restart Alertmanager to pick it up
kubectl rollout restart statefulset/alertmanager-kube-prometheus-stack-alertmanager -n monitoringPersistent Storage in Production
The storageSpec in Step 1 creates a PersistentVolumeClaim for Prometheus. For Grafana, add:
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=5Gi \
--reuse-valuesWithout persistent storage, your dashboards and alert history disappear on pod restart.
Official References
- Prometheus Documentation — Official Prometheus docs: data model, PromQL, scrape configuration, and alerting
- kube-prometheus-stack Helm Chart — The standard Helm chart for deploying Prometheus, Alertmanager, and Grafana together
- Grafana Documentation — Grafana docs: dashboard building, variables, alerts, and data source configuration
- Prometheus Operator — How ServiceMonitor and PodMonitor CRDs work for declarative scrape configuration
- PromQL Basics — Official PromQL reference covering selectors, functions, and aggregation operators
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.