I like metrics. But I don’t like alerts. You know, all that staff - prometheus (or victoriametrics) + grafana + alertmanager + loki, and if you care aboit capacity - thanos (storage, compactor, query…). You can easy deploy it and dive into hell of maintain. Create your own dashboards to show your own metrics, your own filter for logs. It happens that I was obliged to implement such things, even dashboards for app! When the dev didn’t implement any metrics/traces in the app.
I created my own status page based on regular metrics, like healthcheck of pods (and compute rules), and I really proud of it. Of course it is not a competitor to betterstack uptime monitor (which is free for custom domain, custom branding, several checks). But it was my thing, I did it, it was in prod and was reliable.
Yeah, I like metrics and hate to get alerts at 3am (hi standard point by ai)). So, automation is the key. It is the key point of inventing compute machines. Why do you set alert and not action based on it? Threshold, rules to check, action to run. Autoscaling, restarting, restoring, banhummerring, deciding…
And yes, all that can be done without ai. If you can describe rules and reasons verbally - you can automate it with a simle script running in kilobytes of memory.
PS: sometimes it is cheaper to shutdown the service than scale to sky)