Network Monitoring: The Five Numbers
Network monitoring is often network-team-only. SREs benefit from these five numbers being visible.
Why five
Network failures rarely look like network failures from the application's point of view. Five wire-level metrics catch most of what app metrics miss.
- App metrics blur. Latency spikes look like 'something is slow' without distinguishing app, DNS, or TCP causes.
- Wire visibility. Packet loss, RTT, connection errors expose the layer the app cannot see.
- Five is enough. A small fixed set keeps the dashboard readable and the runbook short.
- Owner. Network team owns the metrics; SREs consume them on dashboards, not by paging the network team.
The five metrics
- 1. Packet loss rate.
- 2. Latency p99 to dependencies.
- 3. Connection-error rate.
- 4. DNS resolution time p99.
- 5. TLS handshake time p99.
Dashboard pattern
One dashboard per service tells the network story at a glance. Five panels for the five metrics, plus drill-downs for the per-dependency view.
- Five panels. One per metric, trend over 24 hours; the dashboard is readable in 10 seconds.
- Per-dependency drill-down. Latency, errors, DNS, TLS broken out by destination service or external dependency.
- Annotations. Deploy markers on the time axis; correlation with network change becomes obvious.
- Linked from runbook. Every related alert links to this dashboard; on-call lands here automatically.
Alert thresholds
Thresholds depend on baseline. Hard-coded numbers are starting points; tune them once you have a week of data.
- Packet loss. Alert above 0.1% sustained for 5 minutes; bursty loss often clears on its own.
- Latency p99. Alert on 50% increase from rolling-week baseline; absolute thresholds are too noisy.
- Connection errors. Alert on 2x baseline; this is usually the earliest sign of network or backend trouble.
- DNS p99 / TLS p99. Alert above 100ms / 200ms respectively; either signals a slow upstream or expiring cert.
Antipatterns
- App-only monitoring. Misses network root causes.
- One global metric. Hides per-dependency issue.
- Threshold without baseline. Wrong alarm rate.
What to do this week
Three moves. (1) Apply this pattern to your highest-risk network path. (2) Measure the failure mode rate before/after. (3) Document the change so the next incident-responder inherits the knowledge.