The Acceptable-Loss Conversation Every SRE Team Must Have
Some failures cannot be prevented at acceptable cost. The conversation that surfaces what is acceptable, with whom, and how it is documented.
The framing
Not every failure can be prevented. Some are too expensive to engineer against.
The conversation is about: which failures are acceptable, at what loss budget, with what mitigation.
Without this conversation, every failure feels unacceptable; the team chases impossible standards.
Examples
A 5-second blip during failover: acceptable up to once per quarter. Engineering effort to eliminate would cost months.
A regional outage during a vendor incident: acceptable. Mitigation is multi-cloud; cost may not justify.
A specific class of customer-data exposure: never acceptable. Engineering effort is unbounded for this.
Document the agreement
Each acceptable-loss item: what it is, why it is acceptable, what the mitigation is, what would change the answer.
Reviewed annually. Risk tolerances shift; the document follows.
Visible to the team. The on-call knows what is in scope vs out before the page fires.