Alert Freshness Check
Are alerts firing on stale data? Check.
What stale-data alerts look like
Alert evaluates against a metric that hasn't updated in an hour. The PromQL returns the last good value, the threshold is satisfied, no page fires.
Result: silent broken detection. The system looks healthy because the metric pipeline is broken, not because the system is fine.
Common during exporter restarts, agent upgrades, and network partitions to the metrics backend.
How to detect staleness
Layer an absent() or rate-of-zero check next to every paging alert. If absent(my_metric{}) for 5m, fire StaleData.
Use Prometheus's staleness markers: a metric is considered stale 5 minutes after the last sample. Datadog has explicit no-data alerts; turn them on.
Track scrape_duration_seconds and up{} per target. A target where up == 0 for 10 minutes is a freshness incident.
Routing freshness alerts
Send freshness alerts to the team that owns the exporter, not the team that owns the service. Often the same team but not always.
Severity is sev2, not sev1. The detection is broken but the system may still be fine. Ticket, not page.
Expire freshness alerts cleanly. When the metric returns, auto-resolve immediately. Do not require a manual click.
Specific examples
kube-state-metrics pod restarted: 30s gap. Don't alert on that gap.
node_exporter network down: indefinite gap. Alert at 10 minutes.
Prometheus federation broken: per-cluster gap. Alert at 5 minutes per cluster.
Apply to all paging alerts
Every sev1 alert needs a paired freshness check. No exceptions.
Skip for ticket-tier alerts. The cost-benefit isn't there.
Schedule a quarterly audit: query absent() across the rule list and confirm freshness coverage.