Alert Freshness Check

Are alerts firing on stale data? Check.

What stale-data alerts look like

Stale-data alerts evaluate against a metric that hasn’t updated in an hour. The PromQL returns the last good value, the threshold is satisfied, no page fires; the result is silent broken detection where the system looks healthy because the metric pipeline is broken, not because the system is fine. Common during exporter restarts, agent upgrades, and network partitions.

How to detect staleness

Detection wires a freshness check next to every paging alert. absent(my_metric{}) for 5 minutes fires StaleData; Prometheus considers metrics stale 5 minutes after the last sample; Datadog has explicit no-data alerts. Track scrape_duration_seconds and up{} per target so up == 0 for 10 minutes triggers a freshness incident.

Routing freshness alerts

Freshness alerts route differently. Send to the team that owns the exporter, not the team that owns the service (often the same team, not always); severity sev2 (detection broken but system may still be fine, ticket not page); auto-resolve immediately when the metric returns rather than requiring a manual click.

Specific examples

Three concrete examples calibrate the freshness windows. kube-state-metrics pod restart: 30 second gap, don’t alert on that gap. node_exporter network down: indefinite gap, alert at 10 minutes. Prometheus federation broken: per-cluster gap, alert at 5 minutes per cluster.

Apply to all paging alerts

The application is targeted at paging tier. Every sev1 alert needs a paired freshness check, no exceptions; skip for ticket-tier alerts because cost-benefit isn’t there; schedule a quarterly audit that queries absent() across the rule list to confirm freshness coverage.