System update available - familiar popup?

“System update available” - familiar popup?
Click. Wait. Done.

IT Infrastructure doesn’t work that way.

Sure, DevOps engineers hate updates. But you have to update - vulnerabilities, new features, end of support. Fine, we’re lazy, we automate it.

If we’re shipping our own app - we build the pipeline, that’s part of the job. Once it’s running, it’s the developers’ problem. They break something - their pain.

But what about updating our own infrastructure tools? k8s version, prometheus, grafana, logging stack, etc. That’s harder. What’s a dev environment for developers is still prod for us. Someone’s using it. Not as critical as real prod, but still.

Let’s take k8s as an example - everything else follows.

The prometheus operator is trickier - three components (prometheus, alertmanager, grafana), helm chart, and CRDs.

In early versions of kube-prometheus-stack (helm v2), CRDs had to be installed manually, before the helm chart. And updated before too. Not after, not together - before.

If you got the order wrong - the old operator didn’t understand the new CRDs. Or the new chart couldn’t install because the schema didn’t match. Result - monitoring is down. Alerts are silent. Because they’re down too.

How we update a kubernetes cluster:

Read the changelog
Check the list of deprecated and removed features
Check the cluster: which of those are we actually using
Find old charts and manifests that depend on them - update those first
Find the new chart version, compare: parameters, templates, values, defaults, renames
Update values.yaml
helm upgrade --atomic (with --reuse-values if needed)
Verify: everything installed, running, nothing crashed

If there’s no new version, or it’s a custom chart - you update it yourself. Fork it, fix the template, or use Kustomize.

Cluster node upgrade? That’s next time. Everything plays out just like in real life.

Bottom line: we haven’t even touched the cluster itself yet. Just the charts. And it’s already dozens of steps.

And all of it lives in one person’s head. The most you’ll hear on a call: “updating charts, haven’t gotten to the cluster yet, about halfway done.”

Until they go on vacation. Or quit.

Bus factor. We know.

How many helm charts do you have in your cluster? When did you last update all of them?