The SLI Data Quality Checks
SLIs are only as good as the data behind them. The checks that catch SLI metric corruption before bad SLIs drive bad decisions.
Missing data
An SLI based on metrics that stop arriving is silently broken. Detection: alert on stale metric data per SLI.
Most common cause of SLI false-good: 'we are at 99.99% because half the data is missing.'
Out-of-range
Metrics that suddenly produce impossible values (latency of 0, success rate >100%). Catch with bounds checks.
Indicates instrumentation bugs that contaminate the SLI.
Staleness on SLIs
SLI based on yesterday's data is dangerous if used for today's decisions.
Display data age on every SLI dashboard. Engineers see stale data before making decisions on it.