How to break your HA monitoring with a single dashboard

How to break your HA monitoring with a single dashboard?

Everyone knows HA is built for high availability - spread the load across multiple instances, and if one goes down, the others keep running.

That’s correct. But there’s always a catch. Like breaking the internal logic of the application itself - when doing exactly what it’s supposed to do kills it. That’s the idea behind DoS attacks: make the application do heavy work with a simple request.


The application itself was working fine. We all know botnets scan everything looking for vulnerabilities - possible php files, known paths, all that. Not a targeted attack, just fishing.

But as a result of those requests we were getting a fast-growing list of unique metrics - each metric had an endpoint label with a unique path the botnet was cycling through by the hundreds. Not a few metrics with different values, but thousands of unique metrics with a few values each.

For Prometheus this is expensive - every unique metric gets its own time series. One long series stores data compactly. Thousands of short ones take up a lot more space.

Thanos storage was growing. Quietly.


So where does the dashboard come in?

Yes, Prometheus was storing unnecessary data - annoying, but not critical. What actually took down the monitoring was a single dashboard showing application health: status, response time, error rate, that kind of thing.

That dashboard was querying metrics. All application metrics. All time series. Each one short, but there were a lot of them.

And no amount of HA helped here - when the promql query ran, thanos-query would eat memory like crazy and get killed by the resource limit. Each replica, one by one.

(Before we set resource limits, it was killing the nodes themselves - the ones dedicated to monitoring. Some HA.)


The attempt to build high availability into monitoring was successfully undermined by bad application metrics.

The dashboard that was supposed to show everything is fine - crashed the monitoring when you opened it.