DevOps & Platform

Monitoring Kubernetes with Prometheus and Grafana

Beginner45 min to complete11 min read

Deploy the kube-prometheus-stack with Helm, understand what it collects out of the box, build a dashboard for your application, and set up your first alert rule — all in under an hour.

Before you begin

  • A running Kubernetes cluster
  • Helm 3 installed
  • kubectl configured
  • At least 2 CPU and 4Gi memory available in the cluster
Prometheus
Grafana
Kubernetes
Monitoring
Observability

You don't need to configure Prometheus from scratch. The kube-prometheus-stack Helm chart deploys Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alert rules that cover the entire Kubernetes stack — nodes, pods, deployments, PVCs, and more.

This tutorial gets you from zero to a working monitoring stack, then shows you how to add your own application metrics.

Step 1: Install the kube-prometheus-stack

bash
1helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
2helm repo update
3
4helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
5  --namespace monitoring \
6  --create-namespace \
7  --set grafana.adminPassword=changeme \
8  --set prometheus.prometheusSpec.retention=15d \
9  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20Gi

This deploys:

  • Prometheus (metrics collection and storage)
  • Grafana (dashboards and visualisation)
  • Alertmanager (alert routing and deduplication)
  • kube-state-metrics (exposes Kubernetes object state as metrics)
  • node-exporter (exposes host-level metrics: CPU, memory, disk, network)

Wait for everything to start:

bash
kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=180s
kubectl get pods -n monitoring

Step 2: Access Grafana

Forward Grafana's port locally:

bash
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

Open http://localhost:3000. Log in with admin / changeme.

Navigate to Dashboards → Browse. You'll find 30+ pre-built dashboards:

  • Kubernetes / Cluster — overall cluster health
  • Kubernetes / Nodes — per-node CPU, memory, disk
  • Kubernetes / Pods — per-pod resource usage
  • Kubernetes / Workloads — deployment/daemonset/statefulset status

Step 3: Access Prometheus

bash
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring

Open http://localhost:9090. This is Prometheus's built-in query UI.

Try a few queries:

promql
1# CPU usage per pod (5-minute average)
2rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
3
4# Memory usage per pod
5container_memory_working_set_bytes{namespace="production"}
6
7# Number of ready replicas per deployment
8kube_deployment_status_replicas_ready{namespace="production"}
9
10# Pod restart count
11increase(kube_pod_container_status_restarts_total[1h])

Step 4: Instrument Your Application

To expose custom metrics from your application, use a Prometheus client library.

Node.js (prom-client):

javascript
1const client = require('prom-client');
2const register = new client.Registry();
3
4// Counter: total HTTP requests
5const httpRequestsTotal = new client.Counter({
6  name: 'http_requests_total',
7  help: 'Total number of HTTP requests',
8  labelNames: ['method', 'status_code'],
9  registers: [register]
10});
11
12// Histogram: request duration
13const httpRequestDuration = new client.Histogram({
14  name: 'http_request_duration_seconds',
15  help: 'HTTP request duration in seconds',
16  labelNames: ['method', 'route', 'status_code'],
17  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
18  registers: [register]
19});
20
21// Expose metrics endpoint
22app.get('/metrics', async (req, res) => {
23  res.set('Content-Type', register.contentType);
24  res.end(await register.metrics());
25});
26
27// Instrument a route
28app.use((req, res, next) => {
29  const end = httpRequestDuration.startTimer({
30    method: req.method,
31    route: req.path
32  });
33  res.on('finish', () => {
34    httpRequestsTotal.inc({ method: req.method, status_code: res.statusCode });
35    end({ status_code: res.statusCode });
36  });
37  next();
38});

Go (prometheus/client_golang):

go
1import (
2    "github.com/prometheus/client_golang/prometheus"
3    "github.com/prometheus/client_golang/prometheus/promauto"
4    "github.com/prometheus/client_golang/prometheus/promhttp"
5)
6
7var (
8    httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
9        Name: "http_requests_total",
10        Help: "Total number of HTTP requests",
11    }, []string{"method", "status_code"})
12
13    httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
14        Name:    "http_request_duration_seconds",
15        Help:    "HTTP request duration",
16        Buckets: []float64{0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5},
17    }, []string{"method", "route"})
18)
19
20// In your main():
21http.Handle("/metrics", promhttp.Handler())

Step 5: Tell Prometheus to Scrape Your App

Create a ServiceMonitor — a CRD that kube-prometheus-stack uses to configure scraping:

yaml
1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4  name: my-app
5  namespace: production
6  labels:
7    release: kube-prometheus-stack   # Must match the Helm release label
8spec:
9  selector:
10    matchLabels:
11      app: my-app
12  endpoints:
13    - port: http
14      path: /metrics
15      interval: 15s
bash
kubectl apply -f servicemonitor.yaml

Your application's Service must have a port named http (or whatever you specify in endpoints.port). Verify Prometheus is scraping it at http://localhost:9090 → Status → Targets.

Step 6: Build a Grafana Dashboard for Your App

In Grafana, click the + icon → Dashboard → Add visualization.

Panel 1: Request rate

promql
sum(rate(http_requests_total{namespace="production"}[2m])) by (status_code)

Set visualization type: Time series. Set legend to {{status_code}}.

Panel 2: P95 latency

promql
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m])) by (le, route)
)

Panel 3: Error rate (5xx)

promql
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[2m]))
/
sum(rate(http_requests_total{namespace="production"}[2m]))

Set threshold to 0.01 (1% error rate = red).

Save the dashboard. Click Share → Export → save the JSON to your repo to version it.

Step 7: Create an Alert Rule

Alert when error rate exceeds 1% for 5 minutes:

yaml
1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4  name: my-app-alerts
5  namespace: production
6  labels:
7    release: kube-prometheus-stack
8spec:
9  groups:
10    - name: my-app
11      interval: 30s
12      rules:
13        - alert: HighErrorRate
14          expr: |
15            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
16            /
17            sum(rate(http_requests_total{namespace="production"}[5m]))
18            > 0.01
19          for: 5m
20          labels:
21            severity: warning
22          annotations:
23            summary: "High 5xx error rate on my-app"
24            description: "Error rate is {{ $value | humanizePercentage }} — investigate logs"
25
26        - alert: PodCrashLooping
27          expr: |
28            increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 3
29          for: 0m
30          labels:
31            severity: critical
32          annotations:
33            summary: "Pod {{ $labels.pod }} is crash looping"
bash
kubectl apply -f prometheus-rules.yaml

Check the rule at http://localhost:9090 → Alerts. It should appear in INACTIVE state (no firing yet).

Step 8: Configure Alertmanager

By default, Alertmanager doesn't route alerts anywhere. Configure Slack notifications:

yaml
1apiVersion: v1
2kind: Secret
3metadata:
4  name: alertmanager-kube-prometheus-stack
5  namespace: monitoring
6stringData:
7  alertmanager.yaml: |
8    global:
9      slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
10
11    route:
12      group_by: [alertname, namespace]
13      group_wait: 30s
14      group_interval: 5m
15      repeat_interval: 4h
16      receiver: slack-alerts
17      routes:
18        - match:
19            severity: critical
20          receiver: slack-critical
21
22    receivers:
23      - name: slack-alerts
24        slack_configs:
25          - channel: "#alerts"
26            title: "{{ .CommonAnnotations.summary }}"
27            text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
28
29      - name: slack-critical
30        slack_configs:
31          - channel: "#oncall"
32            title: "CRITICAL: {{ .CommonAnnotations.summary }}"
bash
kubectl apply -f alertmanager-config.yaml
# Restart Alertmanager to pick it up
kubectl rollout restart statefulset/alertmanager-kube-prometheus-stack-alertmanager -n monitoring

Persistent Storage in Production

The storageSpec in Step 1 creates a PersistentVolumeClaim for Prometheus. For Grafana, add:

bash
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=5Gi \
  --reuse-values

Without persistent storage, your dashboards and alert history disappear on pod restart.

Official References

  • Prometheus Documentation — Official Prometheus docs: data model, PromQL, scrape configuration, and alerting
  • kube-prometheus-stack Helm Chart — The standard Helm chart for deploying Prometheus, Alertmanager, and Grafana together
  • Grafana Documentation — Grafana docs: dashboard building, variables, alerts, and data source configuration
  • Prometheus Operator — How ServiceMonitor and PodMonitor CRDs work for declarative scrape configuration
  • PromQL Basics — Official PromQL reference covering selectors, functions, and aggregation operators

We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.

Struggling with this in production?

We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.