Alerts as Data Pattern
Alert events as a stream. Powerful for analysis.
Alerts as a data stream
Every alert event (fired, acked, resolved, snoozed) is a row in a table. Pipe Alertmanager, PagerDuty, or Opsgenie webhooks into BigQuery, Snowflake, or ClickHouse.
Schema: alert_id, fired_at, ack_at, resolved_at, severity, owner_team, runbook_url, related_service. Add labels as JSON for flexible querying.
Retain 18 months. Long-window analysis (year-over-year noise, seasonality) needs a year of data minimum.
Queries that pay for themselves
Top 10 noisiest alerts last quarter. Use it to drive cleanup.
Mean time to ack and to resolve, broken down by team, severity, and time of day. Spot rotations that are quietly burning out.
Correlation: which alerts fire together. The pairs reveal hidden dependencies and let you collapse 5 alerts into 1.
What to put on the dashboard
Alert volume per week with a 13-week rolling average. Spikes warrant a postmortem.
Per-team page count per on-call shift. The team carrying 30 pages a shift will quit.
Alerts fired with no action taken, weighted by severity. This is your noise budget.
Retention and PII
Alert payloads can carry user IDs, IPs, error messages with email addresses. Strip PII at ingest, not at query time.
Use a deny-list on labels and a strict schema on the JSON column. Reject alerts that drop unstructured data into the stream.
Encrypt at rest, restrict access to the on-call analytics group. Audit access quarterly.
When to invest in this
If your alert volume is above 50 a week or your team has more than 3 rotations, build the pipeline. The ROI is the alerts you retire.
Smaller teams can use PagerDuty Insights or the equivalent until volume justifies a custom warehouse.
Don't analyse alerts in spreadsheets. The work to keep a sheet current is more than the work to wire up a webhook.