Retry With Jitter

Avoid thundering herd.

Overview

Retry with jitter adds randomness to retry timing to prevent the thundering-herd failure mode where every client retries simultaneously after a brief outage. Without jitter, exponential backoff still produces synchronized waves of retries that re-exhaust the recovering service; with full jitter, retries spread across the full backoff window and the service can recover. AWS’s "Exponential Backoff and Jitter" essay is the canonical reference; full jitter (random across the full window) outperforms equal jitter in most workloads.

The approach

The practical approach is exponential backoff with full jitter (delay = random(0, base * 2^attempt)), per-call retry cap of 3 to 5 to prevent infinite loops, monitor per-service retry-to-original-call ratio (sudden spikes indicate downstream issues), require idempotency for any retried operation (retry without idempotency produces duplicates), and document the per-client retry policy in the service repo so the rules are reviewable.

Why this compounds

Retry-with-jitter discipline compounds across services. Each correct retry policy preserves resilience without contributing to thundering-herd cascades; each documented policy survives team turnover; the team builds intuition for retry safety that pays off on every new client. Without the discipline, every transient outage becomes a sustained one because the retries themselves keep the service down.

Retry-with-jitter discipline is a reliability discipline that pays off across years. Nova AI Ops integrates with retry telemetry, surfaces retry patterns, and supports the team’s reliability discipline.