The Trace Storage Tier Strategy
Traces are bulky. Hot/warm/cold tiers cut cost without losing debugging value. The transitions and queries by tier.
Hot: 24 hours
Trace storage tier strategy is the discipline of keeping recent traces fast-access and progressively moving older traces to cheaper tiers. Without tiering, all traces stay at the same cost forever; with tiering, the cost matches the access pattern. The strategy follows the principle that recent data is queried often and old data rarely.
What the hot tier provides:
- Hot: 24 hours.: The most recent 24 hours of traces are in the hot tier. Every trace, full detail, indexed for fast query. The hot tier is the working set for the team's live operations.
- Full retention.: No sampling, no aggregation, no truncation. Every trace produced in the last 24 hours is queryable in full. The data supports any investigation the team needs.
- Fast queries.: The query latency is sub-second to a few seconds. Engineers iterate quickly; the investigation flow is preserved; time-to-understanding is short.
- Used for live debugging and incident response.: Active incidents query the hot tier. The just-happened-now investigation hits the hot tier. The fast access matches the urgency of these queries.
- Most expensive tier.: The hot tier costs the most per byte stored. Storage is fast; indexes are comprehensive; the operational characteristics are premium.
- Right-size to the team's active debug window.: The 24-hour window is conventional; some teams use 12 hours; others use 48 hours. The right size matches when the team typically reaches for trace data; trim aggressively to reduce cost.
The hot tier is for the queries that matter most. Optimize it for the workflow; pay for what you use.
Warm: 24h-7d
The warm tier holds traces from 24 hours to 7 days old. The access pattern is less frequent; the cost can be lower. Sampling reduces volume without losing the high-value traces.
- Warm: 24 hours to 7 days.: Traces in this age range are accessed for postmortems, weekly reviews, and follow-up investigations. The access frequency is much lower than for hot data; the cost can be lower.
- Sampled retention.: Not every trace is preserved. Errors and slow traces are kept (the high-value ones); healthy traces are sampled. The sampling reduces volume by 80 to 95% while preserving the traces that matter most.
- The error and slow traces stay.: Postmortem investigation primarily looks at errors and slow traces. The sampling preserves these; the investigation has the data it needs.
- Healthy ones get pruned.: Healthy traces are pruned to a representative sample. Statistical analysis still works; per-trace lookup of healthy traces is rarely useful at this age.
- Useful for postmortems within the week.: The warm tier covers the typical postmortem horizon. An incident from yesterday is hot; from 3 days ago is warm; both are accessible.
- 90% cheaper than hot.: The cost reduction is dramatic. The combination of sampling (less volume) and slower storage (cheaper per byte) produces order-of-magnitude savings.
The warm tier balances access frequency with cost. The team can still find what they need; the cost is much lower than hot.
Cold: 7d+
The cold tier is for very old traces. Per-trace lookup is rarely needed; aggregates and exemplars cover the use cases that remain. The cost drops further.
- Cold: 7 days plus.: Traces older than 7 days are in the cold tier. The access pattern is occasional: compliance, long-term trend analysis, very old incident retrospectives.
- Aggregated only.: Individual traces are not preserved. Aggregates are: total trace counts per service per day, error rate trends, latency percentile trends. The aggregates support the queries that the cold tier serves.
- Per-trace not retrievable.: Looking up a specific trace from 3 months ago is not possible. The team's expectations match this; the requests for old per-trace lookup are extremely rare.
- Aggregates and exemplars are.: Daily aggregates and exemplar traces (one representative trace per day per service) are preserved. The exemplars support the rare cases where a specific old trace is needed.
- Compliance and trend analysis.: The cold tier serves compliance retention requirements and long-term trend dashboards. Both use cases are accommodated by aggregates; per-trace detail is not needed.
- Sampling artifacts are fine at this level.: The aggregates lose some detail to sampling; that loss is acceptable for the queries the cold tier serves. The cost savings dominate the discussion.
Trace storage tier strategy is one of those observability cost optimizations that compound proportionally to trace volume. Nova AI Ops integrates with trace storage backends, surfaces tier transition patterns, and helps teams calibrate their tier sizes to match actual access patterns.