Wiring an SRE Agent into PagerDuty

Webhooks in. Acknowledgements out. The integration code, the auth pattern, the retry policy, and the bug that took us six weeks to find.

Webhooks in

PagerDuty fires a webhook on incident creation. The agent service receives it, verifies the signature, extracts the incident details; signature verification is non-negotiable because without it the endpoint is a denial-of-service surface; extract incident id, urgency, description, service id, escalation policy because these five fields cover most agent use cases.

Webhook on incident creation. Agent service receives; verifies signature; extracts details.
Signature verification non-negotiable. Without it, endpoint is a DoS surface.
Five fields cover most cases. Incident id, urgency, description, service id, escalation policy.
Per-event idempotency. Webhook delivery may retry; the agent must handle.

Acknowledgements out

The agent acknowledges and notes via the API. When the agent starts triaging, it acknowledges the incident in PagerDuty (silent to the team, the agent’s name is the acknowledger); when the agent finishes, it adds a note with the hypothesis that appears in the incident timeline; if the agent escalates, it does not auto-resolve and the human takes over.

Ack on triage start. Silent to team; agent named as acknowledger.
Note with hypothesis on finish. Appears in incident timeline; the agent’s output.
No auto-resolve on escalation. Human takes over and resolves; the agent yields.
Per-action audit log. Each PagerDuty mutation logged; supports investigation.

Auth pattern

The auth model is conservative. API key per environment, rotated quarterly with no sharing across environments; scopes are minimum required (“acknowledge incidents and add notes” is enough for most agents); audit log enabled because PagerDuty’s own audit log shows what the agent did and is useful for post-incident review.

Per-environment API key. Rotated quarterly; not shared across environments.
Minimum scope. “Acknowledge plus note” covers most agents.
Audit log enabled. Shows what the agent did; supports post-incident review.
Per-key rotation cadence. Quarterly; documented; supports compliance.

Retry policy

The retry policy distinguishes idempotent from non-idempotent. Idempotent operations retry up to 3 times with exponential backoff; non-idempotent operations like creating notes retry once with a deduplication token (the dedup prevents double-notes on transient failures); hard cap 5 retries total per webhook (beyond that, log the failure and stop because the agent’s job is best-effort and PagerDuty has its own state of truth).

Idempotent: 3 retries with backoff. Standard pattern; safe to repeat.
Non-idempotent: 1 retry with dedup token. Prevents double-notes on transient failures.
5-retry hard cap. Beyond that, log and stop; PagerDuty is source of truth.
Per-operation classification. Each operation tagged idempotent or not; supports correct retry.

The bug we hunted for six weeks

The lesson came from a six-week debugging session. Webhook signatures occasionally appeared invalid (the error rate was 0.3%, not enough to alert, enough to lose runs); root cause was that PagerDuty’s webhook payload contains a timestamp and we were stripping trailing whitespace from the body before verifying, but the signature was computed on the original body. Lesson: when verifying webhooks, do not normalise the body, verify the bytes you received exactly.

0.3% silent error. Not enough to alert; enough to lose runs.
Whitespace strip broke signature. Signature was on original body; normalised body didn’t match.
Don’t normalise the body. Verify the exact bytes received; the discipline.
Per-webhook byte-perfect verification. Three-line fix; three-week diagnosis; the lesson compounds.