Alerts as Data Pattern

Alert events as a stream. Powerful for analysis.

Alerts as a data stream

Treat every alert event as a row in a table. Pipe Alertmanager, PagerDuty, or Opsgenie webhooks into BigQuery, Snowflake, or ClickHouse; schema with alert_id, fired_at, ack_at, resolved_at, severity, owner_team, runbook_url, related_service plus labels as JSON for flexible querying; retain 18 months because long-window analysis (year-over-year noise, seasonality) needs a year of data minimum.

Queries that pay for themselves

Three queries pay back the export. Top 10 noisiest alerts last quarter (drives cleanup); mean time to ack and resolve broken down by team, severity, and time of day (spots rotations quietly burning out); correlation which alerts fire together (pairs reveal hidden dependencies and let you collapse 5 alerts into 1).

What to put on the dashboard

Three panels make the alert health visible. Alert volume per week with a 13-week rolling average (spikes warrant a postmortem); per-team page count per on-call shift (the team carrying 30 pages a shift will quit); alerts fired with no action taken, weighted by severity (this is your noise budget).

Retention and PII

Retention has security obligations. Alert payloads can carry user IDs, IPs, error messages with email addresses; strip PII at ingest not at query time because once it’s in the warehouse you cannot un-leak it; use a deny-list on labels and a strict schema on the JSON column; encrypt at rest, restrict access to the on-call analytics group, audit access quarterly.

When to invest in this

The investment threshold is volume and team count. If alert volume is above 50 a week or the team has more than 3 rotations, build the pipeline (the ROI is the alerts you retire); smaller teams can use PagerDuty Insights or equivalent until volume justifies a custom warehouse; don’t analyse alerts in spreadsheets because the work to keep a sheet current is more than wiring up a webhook.