Kubernetes Upgrade: Two Weeks, One Person, the Whole Team's Rhythm

Alright. “Next time” is now.

Cluster node upgrades. Charts are done, deployments verified - now the nodes themselves.

Managed clusters - EKS, GKE, AKS - there’s actually a button. With caveats: watch addon versions, node group compatibility, don’t skip minor versions. But manageable.

Self-hosted - we do it ourselves. There’s no “upgrade cluster” button in kubeadm. There’s a sequence you can’t get wrong:

Backup (should be regular, but especially - before upgrading)
kubectl drain <node> - evict pods from the node
apt-mark unhold kubelet kubeadm kubectl
apt install kubelet=1.32.0-1.1 kubeadm=1.32.0-1.1 kubectl=1.32.0-1.1 - version must be explicit
kubeadm upgrade apply - on the control plane, then kubeadm upgrade node on workers
Fix anything that’s off
kubectl uncordon <node>, quick check
Repeat for each next node

One node at a time. Not all at once.

kubespray, talos, and similar - upgrade through the same tooling. But check dependency versions: kubespray doesn’t always keep up with upstream k8s.

After upgrading: check pods, metrics, alerts.
Then repeat the whole thing on prod - accounting for everything that broke in dev.

Two weeks later. Another standup.
“Upgrading nodes, about two thirds done.”

This isn’t procrastination. It actually takes this long.

While the upgrade is running - one person is holding in their head simultaneously: version compatibility, which node is already done, what broke in dev and needs to be handled differently in prod, don’t forget to check alerts after each node. One person. Full attention.

They have a pile of notes and checklists so nothing gets missed. Those are personal notes, not documentation. Documentation comes later - if there’s time between the upgrade finishing and the next “urgent” task. At least the troubleshooting list should be readable.

If you don’t write the docs right after - a month later you can only make sense of those notes by reproducing the whole process again. Next upgrade: they’ll fix problems faster. But they won’t avoid them.

For the manager this means: new feature - it waits. Urgent fix - depends how bad. Onboarding a new dev - bad timing. A cluster upgrade doesn’t block one task. It blocks the whole team’s rhythm for weeks.

Bus factor isn’t abstract here. It’s specific: when that person leaves, they take with them the notes, the list of gotchas from the last upgrade, the understanding of “why we do it this way.” The next person starts from scratch. Same rakes.

Do you have a runbook for cluster upgrades? Or is it “ask Dave”?
Follow. While Dave is still here.

P.S. This was the cluster itself. But there’s one more layer - the OS underneath. That’s next time.