Alert Template Discipline
Alert messages should be informative. The template.
Why templates
Hand-rolled alerts vary in quality. Templates encode the team standard for required labels, runbook link, dashboard link, and severity rules; the discipline reduces review burden because the reviewer checks that the template was used, not whether every field is correct.
- Encode team standard. Required labels, runbook link, dashboard link, severity rules; the convention lives in code, not in heads.
- Reduce review burden. Reviewer checks the template was used, not whether each field is correct; faster review cycle.
- Cheap audits. Linter on the template fields is a 1-line check; the audit cost is fixed regardless of alert count.
- Per-team consistency. Templates enforce one alert shape across the team; supports cross-team incident response without surprises.
What a template includes
A useful template is opinionated about what every alert must carry. Required labels for ownership and routing, required annotations for human-readable context, and default routing rules so the alert reaches the right place by default.
- Required labels. owner_team, severity, runbook_url, dashboard_url, service, environment.
- Required annotations. summary (under 80 chars), description (under 500 chars), impact (one sentence).
- Default routing rules. Sev1 to PagerDuty, sev2 to PagerDuty during business hours, sev3 to Slack-only.
- Per-template version pin. Templates versioned so changes can be rolled out gradually; supports safe template evolution.
Per-category variants
One template cannot cover every alert shape. Per-category variants for latency, error-rate, and saturation each take the inputs they need and emit the right query; the variants share the common label and annotation contract while specialising the query body.
- Latency alert template. Takes service name, threshold, time window; generates the SLO burn-rate query.
- Error-rate template. Takes service name, threshold, comparison window; generates the percentage-over-baseline rule.
- Saturation template. Takes resource, threshold, sustained-duration; generates the queue-depth or wait-time alert.
- Per-variant common contract. All variants emit the same labels and annotations; the routing layer treats them uniformly.
How to enforce the template
Templates only matter if they are enforced. A Helm chart, Terraform module, or Jsonnet library wraps the template; authors call the function rather than write raw alert config. The linter rejects raw config that bypasses the wrapper, and the PR template asks which template was used and why.
- Wrapper library. Helm chart, Terraform module, or Jsonnet library wraps the template; authors call the function, they cannot bypass.
- Linter rejection. Raw alert config that doesn’t go through the wrapper is rejected at PR time.
- PR template question. Which alert template did you use, and why; the discipline lives in the review checklist.
- Per-team approval for custom. Custom alerts allowed but need explicit approval; the friction protects the standard.
Start small
Template adoption is a long game. Pick the 3 most common alert shapes and build templates for those first; don’t try to template every edge case on day one. Adoption percentage is the leading indicator: under 80% after 6 months means the templates are too rigid.
- Top 3 alert shapes first. Cover the most common cases; the long tail can wait.
- Custom still allowed. Edge cases get custom alerts with explicit approval; templates do not block legitimate exceptions.
- Adoption as leading indicator. Under 80% template usage after 6 months means the templates are too rigid; refine them.
- Per-quarter template review. Templates evolve as the alerting practice evolves; supports continuous improvement.