The SLI Data Quality Checks
SLIs are only as good as the data behind them. The checks that catch SLI metric corruption before bad SLIs drive bad decisions.
Missing data
SLI data quality checks are the discipline that prevents SLIs from producing false-good signals. SLIs are only as reliable as the data they consume; bad data produces bad SLIs; the team's reliability commitment becomes meaningless. The discipline catches the data quality issues that mask real reliability problems.
What missing data looks like:
- An SLI based on metrics that stop arriving is silently broken.: When the metric source stops emitting, the SLI calculation continues but with missing data. The result is misleading; without detection, the team operates on a broken SLI.
- Detection: alert on stale metric data per SLI.: Each SLI's source metric is monitored for staleness. If the metric stops arriving, the SLI's stale-data alert fires; the team is notified that the SLI cannot be trusted.
- Most common cause of SLI false-good.: The pattern is recurring. The team's dashboard says they are at 99.99%; the reality is half the data is missing; the actual reliability is unknown.
- "We are at 99.99% because half the data is missing.": The phrase captures the failure mode. The percentage looks good; the underlying data is broken; the team's confidence is unwarranted.
- Per-source health checks.: Each metric source has its own health check. The aggregate SLI's health depends on its sources; if any source is broken, the SLI is suspect.
Missing data is the most common failure mode. Detection is the discipline.
Out-of-range
Some metrics produce impossible values when broken. Latency of 0, success rate over 100%, negative request count. Bounds checks catch these; the SLI does not consume contaminated data.
- Metrics that suddenly produce impossible values.: A latency value of 0 milliseconds is impossible (network plus processing always takes some time). A success rate over 100% is impossible. The values indicate broken instrumentation.
- Latency of 0, success rate over 100%.: Specific patterns are well-known. Bounds checks for each metric type catch the impossible values.
- Catch with bounds checks.: The data quality check verifies values are within reasonable bounds. Out-of-bounds values are flagged; the underlying instrumentation is investigated.
- Indicates instrumentation bugs.: Out-of-range values are typically instrumentation bugs. Division by zero in the calculation, missing data being interpreted as 0, sign errors, all produce out-of-range values.
- Contaminates the SLI.: If the SLI consumes the contaminated data, the SLI is contaminated. The bounds check prevents the contamination from affecting the SLI.
Out-of-range checks catch a specific class of instrumentation bug. Each is small but the cumulative effect is significant.
Staleness on SLIs
Even when data is arriving and within range, it can be stale. SLIs should display their data age; engineers should see staleness before making decisions on the SLI.
- SLI based on yesterday's data is dangerous if used for today's decisions.: Today's SLI calculated from yesterday's data does not reflect today's reality. The team makes decisions on stale information; the decisions can be wrong.
- Display data age on every SLI dashboard.: Each SLI dashboard shows the age of the underlying data. Engineers see "data current as of 14:32" or similar; staleness is visible.
- Engineers see stale data before making decisions on it.: The visible data age prevents stale-data decisions. Engineers know to trust current data and to investigate stale data; the discipline is supported by the visualization.
- Alert on excessive staleness.: Beyond a threshold, the staleness fires an alert. SLI data more than 5 minutes old (or whatever the SLI's threshold is) produces an alert; the team responds.
- Different SLIs have different staleness tolerances.: Real-time SLIs need fresh data; batch SLIs tolerate longer windows. The threshold matches the SLI's nature; the alerting is calibrated.
SLI data quality checks are one of those reliability disciplines that ensure the reliability metrics themselves are reliable. Nova AI Ops integrates with SLO platforms, performs data quality checks across SLI sources, and produces the data-trustworthy view that the team uses for actual reliability decisions.