Wiring an SRE Agent into PagerDuty

Webhooks in. Acknowledgements out. The integration code, the auth pattern, the retry policy, and the bug that took us six weeks to find.

Webhooks in

PagerDuty fires a webhook on incident creation. The agent service receives it, verifies the signature, extracts the incident details; signature verification is non-negotiable because without it the endpoint is a denial-of-service surface; extract incident id, urgency, description, service id, escalation policy because these five fields cover most agent use cases.

Acknowledgements out

The agent acknowledges and notes via the API. When the agent starts triaging, it acknowledges the incident in PagerDuty (silent to the team, the agent’s name is the acknowledger); when the agent finishes, it adds a note with the hypothesis that appears in the incident timeline; if the agent escalates, it does not auto-resolve and the human takes over.

Auth pattern

The auth model is conservative. API key per environment, rotated quarterly with no sharing across environments; scopes are minimum required (“acknowledge incidents and add notes” is enough for most agents); audit log enabled because PagerDuty’s own audit log shows what the agent did and is useful for post-incident review.

Retry policy

The retry policy distinguishes idempotent from non-idempotent. Idempotent operations retry up to 3 times with exponential backoff; non-idempotent operations like creating notes retry once with a deduplication token (the dedup prevents double-notes on transient failures); hard cap 5 retries total per webhook (beyond that, log the failure and stop because the agent’s job is best-effort and PagerDuty has its own state of truth).

The bug we hunted for six weeks

The lesson came from a six-week debugging session. Webhook signatures occasionally appeared invalid (the error rate was 0.3%, not enough to alert, enough to lose runs); root cause was that PagerDuty’s webhook payload contains a timestamp and we were stripping trailing whitespace from the body before verifying, but the signature was computed on the original body. Lesson: when verifying webhooks, do not normalise the body, verify the bytes you received exactly.