Pre-Paging Context Loading

Context loaded before the on-call sees the page.

The idea

By the time a human is paged, the system already has 90 seconds of context: what changed recently, which alerts also fired, which dashboards are relevant. Pre-paging context loading attaches that data to the alert payload before the page goes out, saving 2-5 minutes of triage per incident; compounded over a year, that’s days of on-call time recovered.

90 seconds of latent context. Recent changes, related alerts, dashboards; available before the page.
Attach before page-out. The alert payload carries the context the human needs.
2-5 minutes saved per incident. Triage starts with data, not investigation.
Days per year recovered. Compounded; the cost of building pays back fast.

What to pre-load

Three categories of context cover most incidents. Recent deploys for the affected service (Argo CD events, GitHub Actions runs; 80% of incidents follow a deploy); related alerts within the last 15 minutes (cluster the firing signals so the on-call sees the full picture); top affected endpoints, top affected customers, current load (all derivable from APM data).

Recent deploys. Argo CD events, GitHub Actions runs; 80% of incidents follow a deploy.
Related alerts (15 min). Cluster firing signals; the on-call sees the full picture.
Top affected endpoints, customers, load. All derivable from APM data; pre-render into the payload.
Per-service context plan. Each service has its own context list; supports targeted enrichment.

How to load

The mechanics are simple. Webhook from PagerDuty into a Lambda or Cloud Run job; the job queries Datadog, Argo CD, and the service catalog and posts back to the alert payload; latency target 30 seconds because slower means the human reaches the alert before the context arrives; cache aggressively because most incidents share context within a 5-minute window.

PagerDuty webhook into Lambda. Or Cloud Run; the enrichment job runs out-of-band.
Multi-source query. Datadog, Argo CD, service catalog; the job assembles the context.
30-second latency target. Slower means human reaches alert before context arrives.
Aggressive caching. 5-minute window shares context; one query per service per minute is enough.

When it fails

Three failure modes deserve mitigation. Stale data (deploy lookup 30 minutes behind is worse than no data because on-call trusts wrong information); too much data (40 lines of context is unreadable on a phone, cap at 5 facts); vendor outages (if Datadog is down, pre-loading fails, fall back to a basic page rather than blocking).

Stale data worse than none. 30-minute-behind deploy lookup; on-call trusts wrong info.
Cap at 5 facts. 40 lines is unreadable on a phone; the cap protects the page.
Vendor outage fallback. If Datadog is down, fall back to basic page; don’t block on context.
Per-failure mitigation policy. Each failure mode has a documented response; supports continued operation.

Get started

The starter ramp is concrete. Pick your top 3 services and build a simple webhook that adds “recent deploys” to the alert payload; measure MTTA and MTTR before and after with a target of a 30-second drop in median triage time; iterate per service because adding context to all services at once is over-investment.

Top 3 services first. Highest leverage; the best starting point.
Simple webhook. “Recent deploys” in the payload; the smallest useful enrichment.
30-second median triage drop. The measurable target; before-and-after MTTR shows the effect.
Per-service iteration. Pick by page volume; add context where it pays back.