Kubernetes Cluster Upgrades Without Downtime: The Strategy That Actually Works
Upgrading Kubernetes clusters is one of those tasks that looks simple in the docs and bites you in production — here's the checklist and strategy I use to avoid surprises.

I've watched enough Kubernetes upgrade war stories to know the pattern: someone skips a minor version because they're two versions behind, or upgrades the control plane without checking for deprecated API usage, or discovers mid-upgrade that a critical add-on doesn't support the new version. The cluster limps along for days while the team figures out what broke.
Kubernetes upgrades don't have to be stressful. What makes them stressful is skipping the preparation phase. This post is that preparation phase, written as a concrete checklist and strategy rather than a conceptual overview.
Version Skew Policy: Why You Can't Skip Versions
Kubernetes has a strict version skew policy. Control plane components (kube-apiserver, kube-controller-manager, kube-scheduler) must be within one minor version of each other. kubelet must be within two minor versions of the API server.
The consequence: if you're on 1.27 and want 1.30, you must upgrade through 1.28, 1.29, 1.30 — three separate upgrades. There is no shortcut.
This is not arbitrary. Each minor version removes APIs that were deprecated two versions earlier. The upgrade path exists to give you time to migrate workloads off deprecated APIs. Trying to skip versions means you'll encounter breaking changes you haven't prepared for.
The practical implication for a team that's two versions behind on a quarterly upgrade cycle: you're not doing one upgrade, you're doing three. Plan accordingly.
Checking Your Current Skew
1# Check control plane version
2kubectl version --short
3
4# Check node kubelet versions
5kubectl get nodes -o custom-columns='NAME:.metadata.name,KUBELET:.status.nodeInfo.kubeletVersion'
6
7# Check component versions
8kubectl -n kube-system get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'If your nodes are more than two minor versions behind your API server, the kubelet on those nodes won't connect properly after the control plane upgrade. Fix node versions first.
Pre-Upgrade Checklist
This is the checklist I run through before every upgrade. Skip any item at your own risk.
1. Detect Deprecated API Usage
When Kubernetes removes an API version (e.g., extensions/v1beta1 Ingress removed in 1.22, batch/v1beta1 CronJob removed in 1.25), manifests using those APIs will fail to apply. Your existing objects in etcd might be silently downgraded — until you try to GET them with the new API server and get a 404.
Use pluto (simpler) or kubent (more comprehensive) to scan for deprecated API usage:
1# Install pluto
2brew install FairwindsOps/tap/pluto
3
4# Scan your Helm releases
5pluto detect-helm -o wide
6
7# Scan a directory of manifests
8pluto detect-files -d ./manifests
9
10# Check what's removed in the target version
11pluto list-versions --target-versions k8s=v1.30.0# Install kubent
sh -c "$(curl -sSL https://git.io/install-kubent)"
# Run it — scans live cluster + Helm releases
kubentTypical output from kubent:
>>> Deprecated APIs removed in 1.25 <<<
-------------------------------------------------------------------------------------------------
KIND NAMESPACE NAME API_VERSION REPLACE_WITH (SINCE)
CronJob production cleanup-job batch/v1beta1 batch/v1 (1.21.0)
PodSecurityPolicy - restricted policy/v1beta1 Use PodSecurity admission (1.21.0)
Fix every entry before proceeding. There are no exceptions to this — a single workload using a removed API will fail silently in ways that are hard to diagnose post-upgrade.
2. Check Add-On Compatibility
Every Kubernetes add-on has a compatibility matrix. Before upgrading:
1# Check installed Helm charts and their versions
2helm list -A
3
4# For each add-on, check the artifact hub or GitHub release notes
5# Key add-ons to check:
6# - cert-manager: https://cert-manager.io/docs/installation/supported-releases/
7# - external-dns: check for Kubernetes version support
8# - cluster-autoscaler: MUST match Kubernetes minor version exactly
9# - metrics-server: check compatibility matrix
10# - aws-load-balancer-controller: check EKS compatibilityThe cluster autoscaler is the most dangerous one. It explicitly states that the autoscaler minor version must match the Kubernetes minor version. Running autoscaler 1.27.x against Kubernetes 1.28 will result in silent failures or incorrect scaling behavior.
3. Verify PodDisruptionBudgets
Node draining during upgrade will evict pods. Without PDBs, you might drain a node and take down all replicas of a critical service simultaneously.
1# Find workloads without PDBs
2kubectl get deployments -A -o json | jq -r '
3 .items[] |
4 select(.spec.replicas > 1) |
5 [.metadata.namespace, .metadata.name, .spec.replicas] |
6 @tsv
7'
8
9# Check existing PDBs
10kubectl get pdb -A
11
12# Check for PDBs that would block drain (minAvailable = current replicas)
13kubectl get pdb -A -o json | jq -r '
14 .items[] |
15 select(.status.disruptionsAllowed == 0) |
16 [.metadata.namespace, .metadata.name, .status.currentHealthy, .status.desiredHealthy] |
17 @tsv
18'A PDB with minAvailable: 100% or maxUnavailable: 0 will block node drain indefinitely. You need kubectl drain to eventually complete, so either fix overly-strict PDBs before upgrading or plan to temporarily relax them.
4. Set minReadySeconds
minReadySeconds is a Deployment/DaemonSet field that controls how long a new pod must be ready before it counts as available. Without it, Kubernetes considers a pod available as soon as its readiness probe passes once — even if it immediately crashes.
1apiVersion: apps/v1
2kind: Deployment
3spec:
4 minReadySeconds: 30 # pod must be healthy for 30s before counted available
5 strategy:
6 rollingUpdate:
7 maxUnavailable: 0
8 maxSurge: 1This is especially important during node drains. Without minReadySeconds, a new pod can be counted as available and an old one evicted before the new pod is actually stable.
5. Test Add-On Upgrades in Staging
If you're also upgrading add-ons (cert-manager, ingress-nginx, etc.) as part of this process, test them in staging first. Add-on upgrades are separate from Kubernetes version upgrades, but they often happen at the same time, which makes debugging harder when both change simultaneously.
EKS Upgrade Strategy: Blue-Green Node Groups
For EKS, my preferred upgrade strategy is blue-green node groups rather than in-place node upgrades. Here's why: managed node group in-place upgrades use a rolling replace that you don't fully control. Blue-green gives you the ability to validate a new node group before migrating any workload.
Step 1: Upgrade the Control Plane
EKS upgrades the control plane through the console or CLI. This is non-disruptive — EKS runs control plane components on managed infrastructure with multiple replicas:
1aws eks update-cluster-version \
2 --name my-cluster \
3 --kubernetes-version 1.30 \
4 --region us-east-1
5
6# Wait for upgrade to complete
7aws eks wait cluster-active --name my-cluster --region us-east-1
8
9# Verify
10aws eks describe-cluster --name my-cluster \
11 --query 'cluster.version' --output textStep 2: Update kubectl and aws-auth
After the control plane upgrade, update your local kubectl:
aws eks update-kubeconfig --name my-cluster --region us-east-1
kubectl version --shortStep 3: Create New Node Group
Create a new node group with the new Kubernetes version:
1aws eks create-nodegroup \
2 --cluster-name my-cluster \
3 --nodegroup-name workers-v130 \
4 --node-role arn:aws:iam::123456789:role/eks-node-role \
5 --subnets subnet-abc123 subnet-def456 \
6 --scaling-config minSize=3,maxSize=10,desiredSize=3 \
7 --ami-type AL2_x86_64 \
8 --instance-types m6i.large \
9 --kubernetes-version 1.30 \
10 --region us-east-1
11
12# Wait for node group to be active
13aws eks wait nodegroup-active \
14 --cluster-name my-cluster \
15 --nodegroup-name workers-v130Step 4: Cordon Old Nodes and Migrate Workloads
1# Get old node group nodes
2OLD_NODES=$(kubectl get nodes -l eks.amazonaws.com/nodegroup=workers-v129 \
3 -o jsonpath='{.items[*].metadata.name}')
4
5# Cordon all old nodes (no new pods will schedule here)
6for node in $OLD_NODES; do
7 kubectl cordon $node
8done
9
10# Drain one node at a time
11for node in $OLD_NODES; do
12 echo "Draining $node..."
13 kubectl drain $node \
14 --ignore-daemonsets \
15 --delete-emptydir-data \
16 --grace-period=60 \
17 --timeout=300s
18
19 # Wait a moment for pods to stabilize
20 sleep 30
21
22 # Verify cluster is healthy before continuing
23 kubectl get nodes
24 kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | grep -v Completed
25doneThe --timeout=300s on drain is important. If a drain doesn't complete within 5 minutes, something is wrong — either a PDB is blocking eviction or a pod isn't terminating gracefully. Don't just increase the timeout; investigate.
Step 5: Delete Old Node Group
Once all nodes are drained and workloads are stable on the new node group:
aws eks delete-nodegroup \
--cluster-name my-cluster \
--nodegroup-name workers-v129 \
--region us-east-1Upgrading Add-Ons
After the node group migration, upgrade EKS-managed add-ons:
1# List the latest available version for each add-on against your target K8s version
2aws eks describe-addon-versions --kubernetes-version 1.30 \
3 --query 'addons[].{Name:addonName,Latest:addonVersions[0].addonVersion}' \
4 --output table
5
6# Resolve the latest version at runtime and upgrade — avoids hardcoding versions
7# that go stale between K8s releases
8
9VPC_CNI_VERSION=$(aws eks describe-addon-versions \
10 --kubernetes-version 1.30 \
11 --addon-name vpc-cni \
12 --query 'addons[0].addonVersions[0].addonVersion' \
13 --output text)
14
15aws eks update-addon \
16 --cluster-name my-cluster \
17 --addon-name vpc-cni \
18 --addon-version "$VPC_CNI_VERSION" \
19 --resolve-conflicts OVERWRITE
20
21COREDNS_VERSION=$(aws eks describe-addon-versions \
22 --kubernetes-version 1.30 \
23 --addon-name coredns \
24 --query 'addons[0].addonVersions[0].addonVersion' \
25 --output text)
26
27aws eks update-addon \
28 --cluster-name my-cluster \
29 --addon-name coredns \
30 --addon-version "$COREDNS_VERSION" \
31 --resolve-conflicts OVERWRITEFor Helm-managed add-ons like cert-manager or ingress-nginx, upgrade after node group migration is complete:
helm upgrade cert-manager jetstack/cert-manager \
--namespace cert-manager \
--reuse-values \
--version v1.14.0Post-Upgrade Verification
Don't call the upgrade done until you've verified these:
1# All nodes on new version
2kubectl get nodes -o wide
3
4# All system pods running
5kubectl get pods -n kube-system
6
7# Check for any pods in error state across all namespaces
8kubectl get pods -A | grep -v -E "Running|Completed|Succeeded"
9
10# Verify API server is responding correctly
11kubectl api-versions | grep -E "apps|batch|networking"
12
13# Run a quick smoke test — create and delete a test pod
14kubectl run test-pod --image=nginx:alpine --restart=Never
15kubectl wait --for=condition=Ready pod/test-pod --timeout=60s
16kubectl delete pod test-pod
17
18# Check cluster-autoscaler logs for errors
19kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50Also run pluto again after the upgrade to confirm no deprecated API usage snuck through:
pluto detect-helm -o wide
pluto detect-files -d ./manifestsCommon Upgrade Failures
PDB blocks drain indefinitely: A PDB with minAvailable equal to the current replica count prevents all evictions. Either scale up the deployment before draining or temporarily relax the PDB. Don't delete the PDB — you'll forget to recreate it.
Admission webhook timeout: Mutating or validating webhooks that are down or slow will block all API operations during upgrade. Check webhook configurations:
kubectl get mutatingwebhookconfigurations -o json | jq -r '.items[].metadata.name'
kubectl get validatingwebhookconfigurations -o json | jq -r '.items[].metadata.name'Cluster autoscaler version mismatch: The autoscaler starts making bad decisions silently. Always update it immediately after the control plane upgrade.
Stuck terminating namespace: Old CRDs from removed controllers can leave namespaces stuck. Check with kubectl get ns and look for Terminating namespaces, then inspect their finalizers.
The upgrade takes 45-90 minutes for a typical production cluster. Most of that is node drain time. If you're doing it right — checking each of these steps, draining carefully, validating between steps — you should have zero user-visible downtime.
About to run a Kubernetes upgrade and want a second opinion on your checklist? Talk to us at Coding Protocols. We review upgrade plans and have run this process on production EKS clusters ranging from 10 to 500 nodes.


