GitHub 2018 Database Incident
Learnings.
Overview
The October 2018 GitHub database incident was a 24-hour data inconsistency event triggered by a 43-second network partition that caused MySQL primaries in different regions to fail over to different replicas. Recovery took longer than the partition itself by four orders of magnitude because the team chose data integrity over fast restoration. The lesson generalises: brief connectivity failures can produce extended impact when consensus protocols disagree.
- Brief partition, long impact. 43 seconds of network trouble produced 24+ hours of degraded service. The amplification factor is what makes the case study load-bearing.
- Failover divergence. East and West coast clusters elected different primaries during the partition. State diverged the moment the partition healed.
- Integrity over speed. Reconciling diverged writes correctly took precedence over restoring service. The trade-off matched the stakes (a code-hosting platform).
- Public communication plus published postmortem. Status-page updates ran throughout; the detailed postmortem followed days later and became required reading across the industry.
The approach
Five disciplines turn the GitHub lesson into operational practice: detect split-brain quickly, prefer integrity to speed, communicate transparently, publish the postmortem, test failover before production does. None are exotic; the discipline is doing all of them, every time.
- Split-brain detection. Monitoring that surfaces divergent primaries within seconds. The faster the detection, the smaller the divergence to reconcile.
- Integrity over speed in recovery. Reconcile correctly even if it takes hours. Premature restoration produces silent data loss.
- Status-page transparency. Public updates throughout long incidents. Customers tolerate downtime; they do not tolerate silence.
- Postmortem publication plus failover game-days. Public PM teaches the industry; quarterly game-days validate the design before a real partition does.
Why this compounds
Every architecture review that applies the lesson reduces the team's split-brain exposure. Failover game-days build the muscle that makes a real partition feel routine. By year two, integrity-first recovery is the team's default and customers see fewer silent-loss incidents than competitors.
- Reduced split-brain risk. Detection and prevention patterns close the obvious failure modes. Real protection rather than theoretical resilience.
- Better incident response. Integrity-first recovery preserves data. Customers keep their work even when the platform is degraded.
- Operational maturity. Failover game-days build confidence. Real partitions feel routine.
- Year-one investment, year-two habit. First review is investment; by year two, integrity-first recovery is the team's default.