AWS S3 2017 Outage Postmortem

Lessons from the famous incident.

Overview

On 28 February 2017 an AWS engineer ran a debugging command intended to remove a small set of S3 servers. A typo in the command removed far more capacity than intended; the cascade brought down S3 in us-east-1 for hours and cascaded through every service that depended on it (which was most of them). The lesson is not the typo; it is what happens when an ops command has unbounded blast radius and runs without a guard.

The approach

The lesson translates into three habits applicable to any cloud operations team: validate ops commands before execution, make blast radius explicit, and require manual approval on operations that can take production down.

Why this compounds

The S3 2017 lessons reshaped operations playbooks across the industry. Every architecture review that applies them reduces ops-command risk a little more; the cumulative effect across years is significant.