Kubernetes Zero-Downtime Upgrade Strategy for EKS and Self-Managed Clusters

I've watched enough Kubernetes upgrade war stories to know the pattern: someone skips a minor version because they're two versions behind, or upgrades the control plane without checking for deprecated API usage, or discovers mid-upgrade that a critical add-on doesn't support the new version. The cluster limps along for days while the team figures out what broke.

Kubernetes upgrades don't have to be stressful. What makes them stressful is skipping the preparation phase. This post is that preparation phase, written as a concrete checklist and strategy rather than a conceptual overview.

Version Skew Policy: Why You Can't Skip Versions

Kubernetes has a strict version skew policy. Control plane components (kube-apiserver, kube-controller-manager, kube-scheduler) must be within one minor version of each other. kubelet must be within two minor versions of the API server.

The consequence: if you're on 1.27 and want 1.30, you must upgrade through 1.28, 1.29, 1.30 — three separate upgrades. There is no shortcut.

This is not arbitrary. Each minor version removes APIs that were deprecated two versions earlier. The upgrade path exists to give you time to migrate workloads off deprecated APIs. Trying to skip versions means you'll encounter breaking changes you haven't prepared for.

The practical implication for a team that's two versions behind on a quarterly upgrade cycle: you're not doing one upgrade, you're doing three. Plan accordingly.

Checking Your Current Skew

bash

1# Check control plane version
2kubectl version --short
3
4# Check node kubelet versions
5kubectl get nodes -o custom-columns='NAME:.metadata.name,KUBELET:.status.nodeInfo.kubeletVersion'
6
7# Check component versions
8kubectl -n kube-system get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

If your nodes are more than two minor versions behind your API server, the kubelet on those nodes won't connect properly after the control plane upgrade. Fix node versions first.

Pre-Upgrade Checklist

This is the checklist I run through before every upgrade. Skip any item at your own risk.

1. Detect Deprecated API Usage

When Kubernetes removes an API version (e.g., extensions/v1beta1 Ingress removed in 1.22, batch/v1beta1 CronJob removed in 1.25), manifests using those APIs will fail to apply. Your existing objects in etcd might be silently downgraded — until you try to GET them with the new API server and get a 404.

Use pluto (simpler) or kubent (more comprehensive) to scan for deprecated API usage:

bash

1# Install pluto
2brew install FairwindsOps/tap/pluto
3
4# Scan your Helm releases
5pluto detect-helm -o wide
6
7# Scan a directory of manifests
8pluto detect-files -d ./manifests
9
10# Check what's removed in the target version
11pluto list-versions --target-versions k8s=v1.30.0

bash

# Install kubent
sh -c "$(curl -sSL https://git.io/install-kubent)"

# Run it — scans live cluster + Helm releases
kubent

Typical output from kubent:

>>> Deprecated APIs removed in 1.25 <<<
-------------------------------------------------------------------------------------------------
KIND                NAMESPACE     NAME                    API_VERSION                  REPLACE_WITH (SINCE)
CronJob             production    cleanup-job             batch/v1beta1                batch/v1 (1.21.0)
PodSecurityPolicy   -             restricted              policy/v1beta1               Use PodSecurity admission (1.21.0)

Fix every entry before proceeding. There are no exceptions to this — a single workload using a removed API will fail silently in ways that are hard to diagnose post-upgrade.

2. Check Add-On Compatibility

Every Kubernetes add-on has a compatibility matrix. Before upgrading:

bash

1# Check installed Helm charts and their versions
2helm list -A
3
4# For each add-on, check the artifact hub or GitHub release notes
5# Key add-ons to check:
6# - cert-manager: https://cert-manager.io/docs/installation/supported-releases/
7# - external-dns: check for Kubernetes version support
8# - cluster-autoscaler: MUST match Kubernetes minor version exactly
9# - metrics-server: check compatibility matrix
10# - aws-load-balancer-controller: check EKS compatibility

The cluster autoscaler is the most dangerous one. It explicitly states that the autoscaler minor version must match the Kubernetes minor version. Running autoscaler 1.27.x against Kubernetes 1.28 will result in silent failures or incorrect scaling behavior.

3. Verify PodDisruptionBudgets

Node draining during upgrade will evict pods. Without PDBs, you might drain a node and take down all replicas of a critical service simultaneously.

bash

1# Find workloads without PDBs
2kubectl get deployments -A -o json | jq -r '
3  .items[] |
4  select(.spec.replicas > 1) |
5  [.metadata.namespace, .metadata.name, .spec.replicas] |
6  @tsv
7'
8
9# Check existing PDBs
10kubectl get pdb -A
11
12# Check for PDBs that would block drain (minAvailable = current replicas)
13kubectl get pdb -A -o json | jq -r '
14  .items[] |
15  select(.status.disruptionsAllowed == 0) |
16  [.metadata.namespace, .metadata.name, .status.currentHealthy, .status.desiredHealthy] |
17  @tsv
18'

A PDB with minAvailable: 100% or maxUnavailable: 0 will block node drain indefinitely. You need kubectl drain to eventually complete, so either fix overly-strict PDBs before upgrading or plan to temporarily relax them.

4. Set minReadySeconds

minReadySeconds is a Deployment/DaemonSet field that controls how long a new pod must be ready before it counts as available. Without it, Kubernetes considers a pod available as soon as its readiness probe passes once — even if it immediately crashes.

yaml

1apiVersion: apps/v1
2kind: Deployment
3spec:
4  minReadySeconds: 30  # pod must be healthy for 30s before counted available
5  strategy:
6    rollingUpdate:
7      maxUnavailable: 0
8      maxSurge: 1

This is especially important during node drains. Without minReadySeconds, a new pod can be counted as available and an old one evicted before the new pod is actually stable.

5. Test Add-On Upgrades in Staging

If you're also upgrading add-ons (cert-manager, ingress-nginx, etc.) as part of this process, test them in staging first. Add-on upgrades are separate from Kubernetes version upgrades, but they often happen at the same time, which makes debugging harder when both change simultaneously.

EKS Upgrade Strategy: Blue-Green Node Groups

For EKS, my preferred upgrade strategy is blue-green node groups rather than in-place node upgrades. Here's why: managed node group in-place upgrades use a rolling replace that you don't fully control. Blue-green gives you the ability to validate a new node group before migrating any workload.

Step 1: Upgrade the Control Plane

EKS upgrades the control plane through the console or CLI. This is non-disruptive — EKS runs control plane components on managed infrastructure with multiple replicas:

bash

1aws eks update-cluster-version \
2  --name my-cluster \
3  --kubernetes-version 1.30 \
4  --region us-east-1
5
6# Wait for upgrade to complete
7aws eks wait cluster-active --name my-cluster --region us-east-1
8
9# Verify
10aws eks describe-cluster --name my-cluster \
11  --query 'cluster.version' --output text

Step 2: Update kubectl and aws-auth

After the control plane upgrade, update your local kubectl:

bash

aws eks update-kubeconfig --name my-cluster --region us-east-1
kubectl version --short

Step 3: Create New Node Group

Create a new node group with the new Kubernetes version:

bash

1aws eks create-nodegroup \
2  --cluster-name my-cluster \
3  --nodegroup-name workers-v130 \
4  --node-role arn:aws:iam::123456789:role/eks-node-role \
5  --subnets subnet-abc123 subnet-def456 \
6  --scaling-config minSize=3,maxSize=10,desiredSize=3 \
7  --ami-type AL2_x86_64 \
8  --instance-types m6i.large \
9  --kubernetes-version 1.30 \
10  --region us-east-1
11
12# Wait for node group to be active
13aws eks wait nodegroup-active \
14  --cluster-name my-cluster \
15  --nodegroup-name workers-v130

Step 4: Cordon Old Nodes and Migrate Workloads

bash

1# Get old node group nodes
2OLD_NODES=$(kubectl get nodes -l eks.amazonaws.com/nodegroup=workers-v129 \
3  -o jsonpath='{.items[*].metadata.name}')
4
5# Cordon all old nodes (no new pods will schedule here)
6for node in $OLD_NODES; do
7  kubectl cordon $node
8done
9
10# Drain one node at a time
11for node in $OLD_NODES; do
12  echo "Draining $node..."
13  kubectl drain $node \
14    --ignore-daemonsets \
15    --delete-emptydir-data \
16    --grace-period=60 \
17    --timeout=300s
18
19  # Wait a moment for pods to stabilize
20  sleep 30
21
22  # Verify cluster is healthy before continuing
23  kubectl get nodes
24  kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | grep -v Completed
25done

The --timeout=300s on drain is important. If a drain doesn't complete within 5 minutes, something is wrong — either a PDB is blocking eviction or a pod isn't terminating gracefully. Don't just increase the timeout; investigate.

Step 5: Delete Old Node Group

Once all nodes are drained and workloads are stable on the new node group:

bash

aws eks delete-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name workers-v129 \
  --region us-east-1

Upgrading Add-Ons

After the node group migration, upgrade EKS-managed add-ons:

bash

1# List the latest available version for each add-on against your target K8s version
2aws eks describe-addon-versions --kubernetes-version 1.30 \
3  --query 'addons[].{Name:addonName,Latest:addonVersions[0].addonVersion}' \
4  --output table
5
6# Resolve the latest version at runtime and upgrade — avoids hardcoding versions
7# that go stale between K8s releases
8
9VPC_CNI_VERSION=$(aws eks describe-addon-versions \
10  --kubernetes-version 1.30 \
11  --addon-name vpc-cni \
12  --query 'addons[0].addonVersions[0].addonVersion' \
13  --output text)
14
15aws eks update-addon \
16  --cluster-name my-cluster \
17  --addon-name vpc-cni \
18  --addon-version "$VPC_CNI_VERSION" \
19  --resolve-conflicts OVERWRITE
20
21COREDNS_VERSION=$(aws eks describe-addon-versions \
22  --kubernetes-version 1.30 \
23  --addon-name coredns \
24  --query 'addons[0].addonVersions[0].addonVersion' \
25  --output text)
26
27aws eks update-addon \
28  --cluster-name my-cluster \
29  --addon-name coredns \
30  --addon-version "$COREDNS_VERSION" \
31  --resolve-conflicts OVERWRITE

For Helm-managed add-ons like cert-manager or ingress-nginx, upgrade after node group migration is complete:

bash

helm upgrade cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --reuse-values \
  --version v1.14.0

Post-Upgrade Verification

Don't call the upgrade done until you've verified these:

bash

1# All nodes on new version
2kubectl get nodes -o wide
3
4# All system pods running
5kubectl get pods -n kube-system
6
7# Check for any pods in error state across all namespaces
8kubectl get pods -A | grep -v -E "Running|Completed|Succeeded"
9
10# Verify API server is responding correctly
11kubectl api-versions | grep -E "apps|batch|networking"
12
13# Run a quick smoke test — create and delete a test pod
14kubectl run test-pod --image=nginx:alpine --restart=Never
15kubectl wait --for=condition=Ready pod/test-pod --timeout=60s
16kubectl delete pod test-pod
17
18# Check cluster-autoscaler logs for errors
19kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50

Also run pluto again after the upgrade to confirm no deprecated API usage snuck through:

bash

pluto detect-helm -o wide
pluto detect-files -d ./manifests

Common Upgrade Failures

PDB blocks drain indefinitely: A PDB with minAvailable equal to the current replica count prevents all evictions. Either scale up the deployment before draining or temporarily relax the PDB. Don't delete the PDB — you'll forget to recreate it.

Admission webhook timeout: Mutating or validating webhooks that are down or slow will block all API operations during upgrade. Check webhook configurations:

bash

kubectl get mutatingwebhookconfigurations -o json | jq -r '.items[].metadata.name'
kubectl get validatingwebhookconfigurations -o json | jq -r '.items[].metadata.name'

Cluster autoscaler version mismatch: The autoscaler starts making bad decisions silently. Always update it immediately after the control plane upgrade.

Stuck terminating namespace: Old CRDs from removed controllers can leave namespaces stuck. Check with kubectl get ns and look for Terminating namespaces, then inspect their finalizers.

The upgrade takes 45-90 minutes for a typical production cluster. Most of that is node drain time. If you're doing it right — checking each of these steps, draining carefully, validating between steps — you should have zero user-visible downtime.

About to run a Kubernetes upgrade and want a second opinion on your checklist? Talk to us at Coding Protocols. We review upgrade plans and have run this process on production EKS clusters ranging from 10 to 500 nodes.

Kubernetes Cluster Upgrades Without Downtime: The Strategy That Actually Works

Version Skew Policy: Why You Can't Skip Versions

Checking Your Current Skew

Pre-Upgrade Checklist

1. Detect Deprecated API Usage

2. Check Add-On Compatibility

3. Verify PodDisruptionBudgets

4. Set minReadySeconds

5. Test Add-On Upgrades in Staging

EKS Upgrade Strategy: Blue-Green Node Groups

Step 1: Upgrade the Control Plane

Step 2: Update kubectl and aws-auth

Step 3: Create New Node Group

Step 4: Cordon Old Nodes and Migrate Workloads

Step 5: Delete Old Node Group

Upgrading Add-Ons

Post-Upgrade Verification

Common Upgrade Failures

Related Topics

Read Next

Podscape vs Lens vs k9s: A Kubernetes Management Tool Comparison

RBAC misconfigurations that break production (and how to fix them)

The Kubernetes Periodic Table: Every Essential Tool Category Explained