Alerts as Data Pattern
Alert events as a stream. Powerful for analysis.
Alerts as a data stream
Treat every alert event as a row in a table. Pipe Alertmanager, PagerDuty, or Opsgenie webhooks into BigQuery, Snowflake, or ClickHouse; schema with alert_id, fired_at, ack_at, resolved_at, severity, owner_team, runbook_url, related_service plus labels as JSON for flexible querying; retain 18 months because long-window analysis (year-over-year noise, seasonality) needs a year of data minimum.
- Per-event row. Fired, acked, resolved, snoozed; each is a row in the table.
- Webhook into warehouse. Alertmanager, PagerDuty, Opsgenie webhooks; BigQuery, Snowflake, ClickHouse.
- Stable schema plus JSON labels. Structured fields plus flexible JSON for label expansion.
- 18-month retention. Long-window analysis needs a year minimum; 18 months gives the buffer.
Queries that pay for themselves
Three queries pay back the export. Top 10 noisiest alerts last quarter (drives cleanup); mean time to ack and resolve broken down by team, severity, and time of day (spots rotations quietly burning out); correlation which alerts fire together (pairs reveal hidden dependencies and let you collapse 5 alerts into 1).
- Top 10 noisy alerts. Last quarter; drives the cleanup ritual.
- MTTA and MTTR by team, severity, time-of-day. Spots rotations quietly burning out.
- Co-firing correlation. Which alerts fire together; pairs reveal hidden dependencies.
- Per-query stored as view. Queries committed to the warehouse; supports continued use.
What to put on the dashboard
Three panels make the alert health visible. Alert volume per week with a 13-week rolling average (spikes warrant a postmortem); per-team page count per on-call shift (the team carrying 30 pages a shift will quit); alerts fired with no action taken, weighted by severity (this is your noise budget).
- Volume with rolling average. Per week with 13-week rolling; spikes warrant postmortem.
- Per-team pages per shift. 30 pages a shift means the team will quit; the burnout proxy.
- No-action weighted alerts. Severity-weighted; this is the noise budget.
- Per-team dashboard view. Each team sees its own slice; supports targeted action.
Retention and PII
Retention has security obligations. Alert payloads can carry user IDs, IPs, error messages with email addresses; strip PII at ingest not at query time because once it’s in the warehouse you cannot un-leak it; use a deny-list on labels and a strict schema on the JSON column; encrypt at rest, restrict access to the on-call analytics group, audit access quarterly.
- Strip PII at ingest. Once in warehouse, cannot un-leak; the discipline is upstream.
- Deny-list plus strict schema. Reject alerts that drop unstructured data into the stream.
- Encrypt at rest, restrict access. On-call analytics group; audit access quarterly.
- Per-quarter access audit. Who reads the data tracked; supports compliance.
When to invest in this
The investment threshold is volume and team count. If alert volume is above 50 a week or the team has more than 3 rotations, build the pipeline (the ROI is the alerts you retire); smaller teams can use PagerDuty Insights or equivalent until volume justifies a custom warehouse; don’t analyse alerts in spreadsheets because the work to keep a sheet current is more than wiring up a webhook.
- 50 alerts/week or 3+ rotations. Build the pipeline; the ROI is the retired alerts.
- Below: vendor analytics. PagerDuty Insights or equivalent; the lower-cost path.
- No spreadsheets. The work to keep a sheet current exceeds the work to wire a webhook.
- Per-org build-buy review. The threshold revisited annually; supports continued fit.