Network Monitoring: The Five Numbers

Network monitoring is often network-team-only. SREs benefit from these five numbers being visible.

Why five

Network failures rarely look like network failures from the application's point of view. Five wire-level metrics catch most of what app metrics miss.

App metrics blur. Latency spikes look like 'something is slow' without distinguishing app, DNS, or TCP causes.
Wire visibility. Packet loss, RTT, connection errors expose the layer the app cannot see.
Five is enough. A small fixed set keeps the dashboard readable and the runbook short.
Owner. Network team owns the metrics; SREs consume them on dashboards, not by paging the network team.

The five metrics

1. Packet loss rate.
2. Latency p99 to dependencies.
3. Connection-error rate.
4. DNS resolution time p99.
5. TLS handshake time p99.

Dashboard pattern

One dashboard per service tells the network story at a glance. Five panels for the five metrics, plus drill-downs for the per-dependency view.

Five panels. One per metric, trend over 24 hours; the dashboard is readable in 10 seconds.
Per-dependency drill-down. Latency, errors, DNS, TLS broken out by destination service or external dependency.
Annotations. Deploy markers on the time axis; correlation with network change becomes obvious.
Linked from runbook. Every related alert links to this dashboard; on-call lands here automatically.

Alert thresholds

Thresholds depend on baseline. Hard-coded numbers are starting points; tune them once you have a week of data.

Packet loss. Alert above 0.1% sustained for 5 minutes; bursty loss often clears on its own.
Latency p99. Alert on 50% increase from rolling-week baseline; absolute thresholds are too noisy.
Connection errors. Alert on 2x baseline; this is usually the earliest sign of network or backend trouble.
DNS p99 / TLS p99. Alert above 100ms / 200ms respectively; either signals a slow upstream or expiring cert.

Antipatterns

App-only monitoring. Misses network root causes.
One global metric. Hides per-dependency issue.
Threshold without baseline. Wrong alarm rate.

What to do this week

Three moves. (1) Apply this pattern to your highest-risk network path. (2) Measure the failure mode rate before/after. (3) Document the change so the next incident-responder inherits the knowledge.