The VPC Flow Logs Discipline
VPC flow logs are powerful and underused. The discipline of capturing, storing, and querying them productively.
Capture
VPC flow logs are the network observability layer for AWS environments. Each log record captures source, destination, port, protocol, packet count, byte count, and action (accepted or rejected) for every network flow. The data is invaluable for security investigation, capacity planning, and cost analysis. The discipline starts with capturing everything; gaps in capture become gaps in visibility.
What good capture looks like:
- All VPCs, all subnets.: Flow logs are enabled on every VPC, every subnet. The coverage is total. Selective capture (only some VPCs) produces blind spots that always seem to be the spots that matter during investigations.
- Per-flow records.: The default record structure captures per-flow metadata. The granularity is the right balance: enough detail to investigate, not so much that storage costs are excessive. Custom record formats can add fields (start/end timestamps, traffic path) where the data justifies the cost.
- Cost: real but not large.: Flow logs cost real money: storage of the data, ingestion to wherever it is consumed, query costs. The total is typically a small fraction of the AWS bill; significantly less than the cost of one investigation that runs blind because flow logs were not captured.
- Skipping VPCs is false economy.: The temptation to skip flow logs on "low-importance" VPCs is real and produces predictable regret. The investigation that needs the data is always in the VPC where they were not captured. The cost of universal coverage is small relative to this risk.
- Capture failures are alerts.: If flow log capture stops (delivery role failure, S3 bucket issues, ingestion problems), the team is alerted. Silent loss of capture is the worst failure mode; alerting prevents it.
Capture is the foundation. Without comprehensive capture, every other discipline that builds on flow logs has gaps.
Storage
The storage strategy determines how long data is queryable, at what cost, and through what tooling. A multi-tier approach matches the access pattern: recent data is hot for investigation; older data moves to cheaper tiers.
- Hot tier (7 days).: The most recent 7 days are stored in a queryable hot tier (Athena over S3, Elasticsearch, similar). Incident response queries return in seconds; the team works directly against the hot data.
- Queryable for incident response.: The hot tier supports interactive queries. Investigators ask questions; results return fast enough to support iterative investigation. Without this responsiveness, investigation slows to a crawl.
- Warm tier (90 days).: The 7 to 90 day window is in a warm tier with somewhat slower queries. Trend analysis, periodic security reviews, and historical incident reconstruction use this tier. Queries return in minutes rather than seconds; the cost is significantly lower than hot.
- Cold tier (1 year).: The longer historical archive is in cold storage. Compliance retention, very old historical lookups. Queries are slow but possible; the storage cost is minimal. Most queries never reach this tier.
- Lifecycle automated.: The transition between tiers is automatic via S3 lifecycle policies or equivalent. Data moves from hot to warm to cold without manual intervention. The team configures once; the storage cost optimizes itself.
The tiering matches access patterns. Recent data is accessed often and warrants the cost; older data is accessed rarely and benefits from cheaper storage.
Query patterns
The value of flow logs comes from the queries that run against them. Common patterns produce both routine operational visibility and security signal.
- Top sources by bytes.: Which workloads are sending the most data? The query identifies bandwidth hogs. New entries are sometimes legitimate (a new feature with heavy traffic) and sometimes anomalous (data exfiltration); the pattern is worth surfacing.
- Top destinations by connections.: Which destinations receive the most connection attempts? Frequent destinations are usually internal services and well-known external services. Unexpected destinations warrant investigation.
- Anomaly: traffic to or from new external IPs.: A workload that suddenly starts communicating with external IPs it has never used before is a strong signal. The signal could be a legitimate new integration or could be compromise. The investigation determines which.
- Sometimes a security event.: Some new external traffic patterns indicate compromise: command-and-control beacons, data exfiltration, lateral movement to attacker infrastructure. Catching these patterns early is high-value security work.
- Cost analysis queries.: Egress traffic to specific destinations, traffic between regions, traffic between services. The cost dimension is often the second-most-valuable use of flow logs after security.
VPC flow logs discipline is one of those AWS observability practices that pays off proportionally to the rigor applied. Nova AI Ops integrates with flow log feeds, surfaces routine and anomalous patterns, and produces the operational and security visibility that the cloud team uses across capacity planning, incident response, and cost reviews.