Agentic SRE Advanced By Samson Tanimawo, PhD Published Apr 12, 2026 5 min read

Race Conditions Between Independent SRE Agents

Two agents fix the same thing. Or one undoes the other. The locking and ordering primitives that prevent races without bottlenecking response.

Examples of races

Two agents try to restart the same pod. The second restart finishes before the first; observed behaviour is incoherent.

One agent rolls back a deploy while another applies a hotfix. The hotfix is lost.

Two agents try to drain different replicas of the same service simultaneously. Both succeed; the service goes to zero capacity.

Locks for mutual exclusion

Lock by resource ARN. Each agent acquires the lock before acting on the resource; releases after verifying.

Lock TTL. Default 5 minutes; expires automatically so a crashed agent does not block forever.

Lock log. Every acquire and release is logged. Audit trail for any race-related incident.

Ordering primitives

Some actions must happen in a specific order across agents. "Drain before terminate" is the canonical example.

Use a workflow engine for these (Temporal, Step Functions, custom). The engine enforces order; agents subscribe.

Per-agent locks are insufficient for cross-agent ordering. The workflow engine is the missing layer.

Detection in production

Lock acquisition failures: a metric. Spikes indicate contention; investigate which agents are racing.

TTL expirations: a metric. Frequent expirations mean an agent is taking too long; budget the work or shorten the lock window.

Inconsistent observed state: the trickiest. Sometimes both agents committed conflicting changes; the system's state is now wrong. Audit-log review surfaces these.

Architectural avoidance

Single-writer pattern: only one agent type can write to a given resource type. Other agents request changes via the writer.

Topic partitioning: route work by resource ARN to a single agent worker. Same resource always handled by the same worker; no concurrency issue.

Coalesce: if two requests for the same resource arrive within a window, treat them as one. Avoids wasted action and races.