Retry With Jitter

Avoid thundering herd.

Overview

Retry with jitter adds randomness to retry timing to prevent the thundering-herd failure mode where every client retries simultaneously after a brief outage. Without jitter, exponential backoff still produces synchronized waves of retries that re-exhaust the recovering service; with full jitter, retries spread across the full backoff window and the service can recover. AWS’s "Exponential Backoff and Jitter" essay is the canonical reference; full jitter (random across the full window) outperforms equal jitter in most workloads.

Avoid thundering herd. Per-retry random delay; without jitter, every client retries at the same moment and re-overloads the recovering service.
Exponential backoff. Per-retry delay doubles each attempt; the doubling lets the service breathe between waves.
Full jitter. Random delay across the full window: random(0, base * 2^attempt); matches AWS guidance for most workloads.
Decorrelated jitter plus retry cap. Decorrelated jitter is the AWS-recommended variant for some patterns; per-call retry cap (3-5) prevents infinite loops.

The approach

The practical approach is exponential backoff with full jitter (delay = random(0, base * 2^attempt)), per-call retry cap of 3 to 5 to prevent infinite loops, monitor per-service retry-to-original-call ratio (sudden spikes indicate downstream issues), require idempotency for any retried operation (retry without idempotency produces duplicates), and document the per-client retry policy in the service repo so the rules are reviewable.

Exponential with full jitter. Per-retry delay = random(0, base * 2^attempt); matches AWS guidance and prevents thundering herd.
Cap retries. Per-call 3-5 retries max; beyond that the call should fail and the system should handle the failure.
Monitor retry rate. Per-service retry-to-original ratio; sudden spikes indicate downstream issues before they become incidents.
Idempotency required plus documented policy. Per-call idempotency is mandatory for retry safety; per-client retry policy committed for operational review.

Why this compounds

Retry-with-jitter discipline compounds across services. Each correct retry policy preserves resilience without contributing to thundering-herd cascades; each documented policy survives team turnover; the team builds intuition for retry safety that pays off on every new client. Without the discipline, every transient outage becomes a sustained one because the retries themselves keep the service down.

Resilience. Right retry policy survives transient issues; the call succeeds on retry rather than failing entirely.
Incident response. Jitter prevents cascades; the recovering service is not killed by synchronized retry waves.
Operational fit. Right retry count for the workload; not so few that transient issues fail, not so many that real failures hide behind retries.
Institutional knowledge. Each retry policy teaches client patterns; the team learns when retries help versus when they amplify the failure.

Retry-with-jitter discipline is a reliability discipline that pays off across years. Nova AI Ops integrates with retry telemetry, surfaces retry patterns, and supports the team’s reliability discipline.