Postgres Replication Lag
Detect; mitigate.
Overview
Postgres replication lag is the time difference between primary writes and replica visibility. Lag affects read consistency, failover safety, and analytics correctness. Catching lag, identifying cause, and mitigating quickly is the discipline; ignoring lag is how data loss happens at failover.
- Detection.
pg_stat_replicationexposes per-replica lag;pg_last_xact_replay_timestamp()shows the freshness of the latest replayed transaction. - Mitigation. WAL tuning, network bandwidth, replica resources. Each cause class has its own fix.
- Streaming vs logical. Streaming replication has different lag profile than logical. Mode selects what tuning matters.
- Read-after-write plus failover impact. Lag breaks read-from-replica patterns; lag at failover means data loss in proportion to the lag duration.
The approach
Three habits keep replication lag tractable: monitor pg_stat_replication continuously, alert on per-severity thresholds, and mitigate by identified cause rather than blanket scaling.
- Monitor pg_stat_replication.
write_lag,flush_lag,replay_lagall on the database dashboard. Three numbers tell the story. - Alert on thresholds. 30-second replay-lag warning; 5-minute critical. Per-severity threshold matches the actual data-loss tolerance.
- Mitigate by cause. Network bottleneck, replica CPU, large transactions all have different fixes. Diagnose before scaling.
- Tune WAL settings plus documented SLO.
max_wal_sendersandwal_keep_sizematch scale; per-replica lag SLO documented.
Why this compounds
Each lag investigation teaches the team a little more about Postgres replication. Compounded across the year, replicas stay healthy and failovers preserve data.
- Data consistency. Healthy replicas serve fresh data. Stale-read incidents shrink.
- Safer failover. Low lag at failover means low data loss. RPO improves measurably.
- Resource planning. Lag investigation reveals capacity issues. The replica tier sizes correctly.
- Year-one investment, year-two habit. The first investigation is heavy lift. By year two, the team diagnoses lag in minutes.