Race Conditions Between Independent SRE Agents

Two agents fix the same thing. Or one undoes the other. The locking and ordering primitives that prevent races without bottlenecking response.

Examples of races

Three race patterns recur. Two agents try to restart the same pod (the second restart finishes before the first; observed behaviour is incoherent); one agent rolls back a deploy while another applies a hotfix (the hotfix is lost); two agents try to drain different replicas of the same service simultaneously (both succeed; the service goes to zero capacity).

Locks for mutual exclusion

Locks are the primitive. Lock by resource ARN (each agent acquires the lock before acting on the resource and releases after verifying); lock TTL default 5 minutes (expires automatically so a crashed agent does not block forever); lock log captures every acquire and release for audit trail.

Ordering primitives

Some actions need cross-agent ordering. “Drain before terminate” is canonical; use a workflow engine for these (Temporal, Step Functions, custom) because the engine enforces order and agents subscribe; per-agent locks are insufficient for cross-agent ordering and the workflow engine is the missing layer.

Detection in production

Three signals catch races in production. Lock acquisition failures (a metric; spikes indicate contention, investigate which agents are racing); TTL expirations (a metric; frequent expirations mean an agent is taking too long, budget the work or shorten the lock window); inconsistent observed state (the trickiest; sometimes both agents committed conflicting changes and the system’s state is now wrong, audit-log review surfaces these).

Architectural avoidance

Three architectural patterns avoid races entirely. Single-writer pattern (only one agent type writes to a given resource type, other agents request changes via the writer); topic partitioning (route work by resource ARN to a single agent worker, same resource always handled by the same worker); coalesce (if two requests for the same resource arrive within a window, treat them as one).