GitHub 2018 Database Incident

Learnings.

Overview

The October 2018 GitHub database incident was a 24-hour data inconsistency event triggered by a 43-second network partition that caused MySQL primaries in different regions to fail over to different replicas. Recovery took longer than the partition itself by four orders of magnitude because the team chose data integrity over fast restoration. The lesson generalises: brief connectivity failures can produce extended impact when consensus protocols disagree.

The approach

Five disciplines turn the GitHub lesson into operational practice: detect split-brain quickly, prefer integrity to speed, communicate transparently, publish the postmortem, test failover before production does. None are exotic; the discipline is doing all of them, every time.

Why this compounds

Every architecture review that applies the lesson reduces the team's split-brain exposure. Failover game-days build the muscle that makes a real partition feel routine. By year two, integrity-first recovery is the team's default and customers see fewer silent-loss incidents than competitors.