Alert Freshness Check
Are alerts firing on stale data? Check.
What stale-data alerts look like
Stale-data alerts evaluate against a metric that hasn’t updated in an hour. The PromQL returns the last good value, the threshold is satisfied, no page fires; the result is silent broken detection where the system looks healthy because the metric pipeline is broken, not because the system is fine. Common during exporter restarts, agent upgrades, and network partitions.
- Last-good-value evaluation. PromQL returns the cached last value; the threshold is satisfied; no page fires.
- Silent broken detection. System looks healthy because the metric pipeline is broken, not because the system is fine.
- Common triggers. Exporter restarts, agent upgrades, network partitions to the metrics backend.
- Per-pipeline risk. Every paging alert that depends on a metric carries this risk; the freshness check covers it.
How to detect staleness
Detection wires a freshness check next to every paging alert. absent(my_metric{}) for 5 minutes fires StaleData; Prometheus considers metrics stale 5 minutes after the last sample; Datadog has explicit no-data alerts. Track scrape_duration_seconds and up{} per target so up == 0 for 10 minutes triggers a freshness incident.
- absent() check. Layered next to every paging alert;
absent(my_metric{})for 5m firesStaleData. - Native staleness markers. Prometheus considers metrics stale 5 minutes after last sample; Datadog has explicit no-data alerts.
- Per-target up{} tracking.
up == 0for 10 minutes is a freshness incident; the target itself is reporting. - Per-alert paired check. Every paging alert has a paired freshness check; the discipline is uniform.
Routing freshness alerts
Freshness alerts route differently. Send to the team that owns the exporter, not the team that owns the service (often the same team, not always); severity sev2 (detection broken but system may still be fine, ticket not page); auto-resolve immediately when the metric returns rather than requiring a manual click.
- Route to exporter owner. Often the same team as the service owner; not always.
- Sev2 severity. Detection is broken but system may still be fine; ticket, not page.
- Auto-resolve on return. When the metric returns, auto-resolve immediately; no manual click.
- Per-route documented owner. Each freshness alert has an owner team; supports the routing discipline.
Specific examples
Three concrete examples calibrate the freshness windows. kube-state-metrics pod restart: 30 second gap, don’t alert on that gap. node_exporter network down: indefinite gap, alert at 10 minutes. Prometheus federation broken: per-cluster gap, alert at 5 minutes per cluster.
- kube-state-metrics restart. 30-second gap; don’t alert on the gap; the noise isn’t worth it.
- node_exporter network down. Indefinite gap; alert at 10 minutes; the threshold catches sustained loss.
- Prometheus federation broken. Per-cluster gap; alert at 5 minutes per cluster; cluster-specific routing.
- Per-source threshold tuning. Each source has its own normal restart window; the freshness threshold matches.
Apply to all paging alerts
The application is targeted at paging tier. Every sev1 alert needs a paired freshness check, no exceptions; skip for ticket-tier alerts because cost-benefit isn’t there; schedule a quarterly audit that queries absent() across the rule list to confirm freshness coverage.
- Every sev1 paired. No exceptions; the freshness check is mandatory at the highest tier.
- Skip ticket tier. Cost-benefit isn’t there; the discipline is targeted.
- Quarterly coverage audit. Query
absent()across the rule list; confirm freshness coverage. - Per-quarter coverage delta. Documented coverage trend; supports continued investment in the discipline.