Reliability Engineering

Network problems are service problems,
caught at the link layer first

Network Monitoring is the network-layer observability slice. Per-link bandwidth and packet loss, per-flow retransmits, per-VPC traffic split, per-DNS-zone failure rate. Network issues often manifest first as confusing service degradations; this page surfaces them as the network problems they are.

Get Started Talk to Sales

app.novaaiops.com / network-monitoring

● LIVE

Network · this hour

vpc-prod-easthealthy

peer · acme-partner2.4% packet loss

dns · payments-internal.acme12% NXDOMAIN spike

nat-gateway · us-east-1a42% saturation

Four Signal Types

Bandwidth, packet, DNS, NAT

Four primary signals: per-link bandwidth (and saturation), per-flow packet retransmits and packet loss, per-DNS-zone failure rates, per-NAT-gateway port-allocation saturation. Each signal is correlated with the services that use it so a network issue shows up as "payments-api is degraded because its NAT gateway is full."

✓
Bandwidth + saturation: per-link and per-VPC bandwidth with saturation thresholds tied to alerts
✓
Packet loss + retransmits: per-flow detection of TCP-level network distress before app errors appear
✓
DNS failures: per-zone NXDOMAIN, SERVFAIL, slow-response detection
✓
NAT saturation: per-NAT-gateway port allocation; saturation here breaks outbound everywhere

app.novaaiops.com / network-monitoring · signals

Signals tracked

bandwidthper-link, per-vpc

packet lossper-flow, retransmit count

dnsnxdomain, servfail, slow-response

natport allocation, saturation

Service Correlation

Network signals tied to services

Every network signal is automatically tied to the services it affects. A packet-loss spike on a peer link surfaces on the affected services' incident pages with "network: 2.4% loss on this peer." The agent fleet sees the network signal alongside service signals so runbooks consider both.

✓
Tied to services: network signals appear on service-specific views, not just on a generic network page
✓
Visible in incidents: incident pages show network signals if relevant; agents read both layers
✓
Cross-signal correlation: feeds into Cross-Signal Correlation as a first-class signal type

app.novaaiops.com / network-monitoring · correlation

Affected services

signalpeer · acme-partner · 2.4% loss

services using peerrefunds, payouts, partner-webhook

visible onrefunds incidents page

eBPF or Flow Logs

Two source paths, same data

Two source options. eBPF: a kernel probe on each host captures every flow. Best fidelity. Flow logs: AWS / GCP / Azure flow log ingestion. Less precise but works without host agents. Pick one or both. Reconciliation when both are present catches gaps in either.

✓
eBPF: kernel-level capture; full fidelity; requires host agent (already deployed)
✓
Flow logs: cloud-native; no host agent; coarser granularity
✓
Reconciled when both: gaps in one source surface as missing edges; catches collection failures

app.novaaiops.com / network-monitoring · sources

Source health

ebpf28 hosts · 100% reporting

aws flow logs2 vpcs · ingesting

reconciliationno gaps detected

DNS-Specific View

A whole subtab for DNS

DNS gets its own subtab because DNS issues are specifically painful. Per-zone failure rate, per-resolver latency, recent NXDOMAIN spikes, recent SERVFAIL spikes. When DNS goes weird, this view tells you which zone, which resolver, and which downstream service is feeling it.

✓
Per-zone failure rate: baseline and spikes per DNS zone; sub-minute resolution
✓
Per-resolver latency: tracking p50/p95/p99 per resolver; useful for split-horizon issues
✓
Downstream impact: each DNS issue lists the services that depend on the affected zone

app.novaaiops.com / network-monitoring · dns

DNS · this hour

payments-internal.acme12% nxdomain (was 0%)

resolver · 10.0.0.2p95 88ms (was 14ms)

impactpayments + checkout

Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

Catch the network before the service

Network monitoring stops the "we spent two hours debugging the app, it was DNS" pattern.

Get Started Request a Demo

Network problems are service problems,caught at the link layer first