Webhook Reliability Patterns
Webhooks are simple to send; hard to make reliable. Four patterns are the standard answer.
Why webhooks fail
Webhooks look like a function call but run over the public internet to a system you do not control. Every failure mode of the network plus every failure mode of the receiver applies.
- Network blips. TCP resets, DNS hiccups, TLS handshake timeouts; the wire is the first place to lose events.
- Receiver downtime. The receiver is in maintenance, deploying, or just slow; naive fire-and-forget drops the event.
- Processing errors. The receiver got the event but choked on it; without retries the sender thinks delivery succeeded.
- Without patterns. Events lost silently, no audit trail, customer reports the missing data weeks later.
Four patterns
- 1. Retries with exponential backoff.
- 2. Idempotency keys per event.
- 3. Signatures for authenticity.
- 4. Dead-letter queue for unrecoverable failures.
Receiver responsibilities
The receiver's contract is short: respond fast, do work async, verify the sender, and deduplicate. Skip any one and the system frays.
- Respond fast. Acknowledge in under 5 seconds; senders treat slow responses as failure and retry.
- Queue the work. Long-running processing happens async on the receiver side, not in the request handler.
- Verify signature. Reject any payload whose HMAC does not match; never trust the source IP alone.
- Deduplicate. Track event-id in a TTL'd store; the same event may arrive twice and must produce one effect.
Sender discipline
The sender owns delivery state, redelivery, and the contract. Without these, debugging webhook drops becomes guesswork on both sides.
- Delivery state. Track every event's send status; expose it on a per-event dashboard for the customer.
- Redelivery API. Let customers replay specific events when their receiver was down; they will need it.
- Signature scheme. Document the HMAC algorithm and key rotation policy publicly; integrators write code against it.
- Published SLA. Concrete number, e.g. '99% delivered within 10 seconds'; track and report it.
Antipatterns
- Webhook without signature. Spoofable.
- Receiver doing work synchronously. Times out; receives retries.
- No DLQ. Permanently lost events; no visibility.
What to do this week
Three moves. (1) Apply this pattern to your highest-risk network path. (2) Measure the failure mode rate before/after. (3) Document the change so the next incident-responder inherits the knowledge.