SLO & Reliability Practical By Samson Tanimawo, PhD Published Oct 16, 2025 4 min read

SLO Testing in Pre-Prod

Test SLO machinery before relying.

Test alerts

The SLO machinery is itself software, and software that has not been tested does not work. Most teams treat their SLO alerts and dashboards as set-and-forget configuration. Then a real incident hits, the alert does not fire, and the team finds out their reliability monitoring was broken silently for months. The fix is to test the SLO machinery the same way you test code: in a pre-production environment, with deliberately injected failures.

What testing the alerting layer requires:

Inject failures deliberately.: In the test environment, force the conditions that should trigger an SLO alert. Drive error rate above threshold for the configured window. Drive latency past the p99 target. Block the dependency that gates a critical path. Each scenario tests a different alert path.
Verify the alert actually fires.: Watch the alert manager. Confirm the alert appears with the right severity, the right routing, the right payload. An alert configured but not firing is the same as no alert at all. The verification is the only thing that proves the configuration works.
Verify the alert routes correctly.: The page reaches the right on-call, in the right channel, with the right payload. Slack message has the playbook link. PagerDuty incident has the right severity. The alert reaches the human who is supposed to act on it within the expected delay.
Verify the alert clears.: After the injected failure recovers, the alert resolves cleanly. Stale alerts that linger after the underlying issue is fixed become noise that desensitizes the team. The clear-up path is as important as the fire-up path.
Build confidence before the real incident.: The first time the alert fires for real should not be the first time you have seen it work. Building familiarity in test means the on-call already knows what the alert looks like, where the runbook is, and how to acknowledge it.

Testing alerts is the cheapest insurance against the most embarrassing class of incident: the one that lasted hours longer than necessary because the alert that was supposed to catch it never fired.

Dashboards

Dashboards have their own failure mode: they show numbers that are wrong but look right. A dashboard that displays 99.99% availability when the service is actually 99.5% is worse than no dashboard, because it actively misleads the team. Verifying dashboard accuracy in pre-prod catches this class of issue before it matters.

Verify accuracy against a known scenario.: In test, inject a known number of failures and verify the dashboard reflects the expected SLO impact. If you inject 10 failures out of 1,000 requests, the dashboard should show 99.0% availability for that window. If it shows 99.9% or 95%, the metric pipeline is wrong.
No silent gaps.: Some metric pipelines drop data on certain failure modes (a metric collector that crashes loses the data it had buffered, an aggregation that times out shows lower-than-actual error rates). Test for these by injecting failures during simulated metric pipeline outages and confirming the dashboard reports correctly.
Verify the per-dimension breakdown.: The composite SLO might look right while a per-dimension number is wrong. Test each dimension separately: latency-only failures should show on the latency tile, availability-only on the availability tile. Mixing or hiding dimensions is a common bug.
Verify time-window math.: Rolling 28-day windows are tricky. The dashboard should compute correctly when traffic is uneven across the window, when the window edge crosses a high-traffic period, when there are gaps in data. Test these explicitly.
Verify drilldown.: Clicking from the SLO tile to the underlying data should resolve to the right query, the right time range, the right service. Broken drilldowns make the dashboard a dead end during incidents.

Dashboard testing is harder than alert testing because dashboards have many more code paths. Investing in it pays back the first time you would have made a wrong decision based on a wrong number.

Playbooks

The third layer to test is the human procedure. When an SLO alert fires, the on-call follows a playbook. The playbook may be subtly wrong, may reference tools that have changed, may have outdated escalation paths. Test playbooks like you test alerts: walk through them, in a non-emergency, against a simulated incident.

When SLO degrades, the runbook fires.: The alert payload links to the runbook. The runbook walks through the standard triage steps. Each step in the runbook is reachable, current, and produces useful information. Run through it end-to-end during the test.
Game day exercises.: Schedule a half-day per quarter where the on-call walks through three or four incident scenarios using only the playbook. Anything the playbook misses or gets wrong gets fixed before the next quarter. The exercise is the test; updating is the result.
Tested under realistic constraints.: The on-call running the test exercise is not the senior engineer who wrote the playbook. It is a rotation engineer who has not seen this scenario before. Their experience is the closest approximation to the actual on-call experience during a real incident.
Track playbook freshness.: Every playbook has a "last verified" date. Anything more than 6 months old is suspect. The verification timestamp is on the playbook itself, automatically refreshed when the test exercise runs successfully against it.
Cross-team playbook tests.: Some playbooks involve calling another team's on-call. Test the cross-team coordination, not just the single-team workflow. The page reaches the right team, they engage in the expected timeframe, the handoff goes cleanly.

Testing the SLO machinery in pre-production is the discipline that turns reliability monitoring from configuration into a practiced operational system. Nova AI Ops includes injection tooling for SLO test scenarios, validates dashboard accuracy against expected outcomes, and tracks playbook freshness so the reliability practice itself is reliable.