Race Conditions Between Independent SRE Agents
Two agents fix the same thing. Or one undoes the other. The locking and ordering primitives that prevent races without bottlenecking response.
Examples of races
Three race patterns recur. Two agents try to restart the same pod (the second restart finishes before the first; observed behaviour is incoherent); one agent rolls back a deploy while another applies a hotfix (the hotfix is lost); two agents try to drain different replicas of the same service simultaneously (both succeed; the service goes to zero capacity).
- Concurrent pod restart. Second finishes before first; observed behaviour incoherent.
- Rollback vs hotfix. One rolls back; another applies hotfix; the hotfix is lost.
- Concurrent replica drains. Both succeed; service capacity goes to zero.
- Per-resource race surface. The shared resource is the conflict; coordination must protect it.
Locks for mutual exclusion
Locks are the primitive. Lock by resource ARN (each agent acquires the lock before acting on the resource and releases after verifying); lock TTL default 5 minutes (expires automatically so a crashed agent does not block forever); lock log captures every acquire and release for audit trail.
- Lock by resource ARN. The natural lock key; each resource has its own lock.
- Acquire-act-verify-release. The standard cycle; supports correctness.
- 5-minute TTL. Crashed agent doesn’t block forever; the safety net.
- Lock log. Every acquire and release; audit trail for race incidents.
Ordering primitives
Some actions need cross-agent ordering. “Drain before terminate” is canonical; use a workflow engine for these (Temporal, Step Functions, custom) because the engine enforces order and agents subscribe; per-agent locks are insufficient for cross-agent ordering and the workflow engine is the missing layer.
- Drain before terminate. The canonical ordering case; locks insufficient.
- Workflow engine for order. Temporal, Step Functions, custom; enforces sequence.
- Per-agent locks insufficient. Cross-agent ordering needs a coordination layer.
- Per-workflow contract. Each cross-agent flow documented; supports correct sequencing.
Detection in production
Three signals catch races in production. Lock acquisition failures (a metric; spikes indicate contention, investigate which agents are racing); TTL expirations (a metric; frequent expirations mean an agent is taking too long, budget the work or shorten the lock window); inconsistent observed state (the trickiest; sometimes both agents committed conflicting changes and the system’s state is now wrong, audit-log review surfaces these).
- Lock acquisition failures. Metric; spikes indicate contention; investigate the racing agents.
- TTL expirations. Metric; frequent expirations mean agent is taking too long.
- Inconsistent observed state. Both agents committed conflicting changes; audit-log review surfaces.
- Per-symptom investigation playbook. Each symptom has a documented response; supports fast diagnosis.
Architectural avoidance
Three architectural patterns avoid races entirely. Single-writer pattern (only one agent type writes to a given resource type, other agents request changes via the writer); topic partitioning (route work by resource ARN to a single agent worker, same resource always handled by the same worker); coalesce (if two requests for the same resource arrive within a window, treat them as one).
- Single-writer pattern. One agent type writes; others request changes via the writer.
- Topic partitioning by ARN. Same resource always handled by same worker; no concurrency.
- Coalesce within window. Two requests in a window become one action.
- Per-pattern blast-radius reduction. Architecture removes the race instead of detecting it.