Database Monitoring: The Five Numbers That Matter

Monitor too few and you miss issues; too many and they drown in noise. Five is the right number.

Why five

One metric misses cases. Twenty metrics drown signal. Five is the sweet spot that catches most production database problems without overwhelming the dashboard.

Single metric. Misses entire failure classes; never enough on its own.
Twenty metrics. Drown signal; on-call ignores most; the discipline collapses.
Five metrics. Cover roughly 90% of operational problems; the cost-benefit sweet spot.
Per-engine variants. The five differ slightly between Postgres, MySQL, MongoDB; the principle holds.

The five metrics

1. Connection count vs max.
2. Slow query rate.
3. Replication lag.
4. Lock wait time.
5. Disk usage growth.

Dashboard pattern

One dashboard per database, structured the same way every time. Standardisation lets engineers parse the page in seconds.

One panel per metric. Five panels, same order on every dashboard; predictable layout.
Trend over 24 hours. Default time range; longer windows surface drift, shorter windows surface incidents.
Threshold lines. Visible on each panel so operators see distance from the alarm threshold at a glance.
Bookmarked URL. Single canonical URL; on-call's first move on a database alert is to open it.

Alert thresholds

Each metric needs an explicit threshold. Without one, the metric is a chart; with one, it is an alert that fires before the user notices.

Connection count. Alert at 80% of max_connections; gives time to investigate before the cap.
Slow query rate. Alert if doubled versus a 7-day baseline; absolute numbers are noisy.
Replication lag. Alert at 30 seconds; failover safety depends on small lag.
Lock wait + disk growth. Lock wait p99 above 5 seconds; disk-projection above 80% at 7-day horizon.

Antipatterns

Monitoring without baseline. Cannot detect anomaly.
Twenty metrics. Drown signal.
No alerts on five. Discovery via incident.

What to do this week

Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.