Database Replicas: Read Replicas vs Failover Replicas
Read replicas and failover replicas look similar; serve different purposes. Conflating them creates surprises.
Why distinguish
Read replicas and failover replicas look the same in the documentation. They serve different purposes; conflating them creates surprises in the worst possible moment.
- Read replica. Serves read traffic; performance lag tolerated; not promotion-ready.
- Failover replica. Standby; minimal lag; ready to take over on primary failure.
- Same shape, different intent. Both are replicas; the operational role is what differs.
- Conflation cost. Treating one as the other shows up during the worst incident, not the best day.
Four criteria
- 1. Lag tolerance.
- 2. Promotion-ready.
- 3. Resource sizing.
- 4. Application connection.
Configuration differences
The two roles need different configuration. Replication mode, instance sizing, and application connection paths all diverge.
- Read replica. Async replication is fine; smaller instance acceptable; app connects via separate read endpoint.
- Failover replica. Synchronous if possible; same size as primary; app discovers via DNS or endpoint switch.
- Endpoint shape. Read traffic goes to a load-balanced read endpoint; failover happens at the primary endpoint level.
- Lag monitoring. Both monitored, with different thresholds; failover lag matters in seconds, read lag in minutes.
Conflation mistake
The two failure modes from conflation are mirror images. Each one ruins a different day; both are avoidable with explicit role assignment.
- Read replica as failover. Primary dies, you fail over to a replica that lags by minutes; data loss.
- Failover replica as read source. Read load slows replication; lag grows; promotion-readiness compromised.
- Mixed role. One replica trying to be both; serves neither role well; expect either data loss or slow failover.
- Documented role. Each replica's role written down; no ambiguity at 3am when the primary is down.
Antipatterns
- Read replica as failover. Data loss risk.
- Failover replica handling read load. Promotion delayed.
- One replica for both. Confused responsibilities.
What to do this week
Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.