AI & ML Advanced By Samson Tanimawo, PhD Published May 26, 2026 5 min read

Long Context Windows: 1M+ Tokens

A million tokens fits an entire codebase or a quarter’s worth of meeting notes. The capability is real. The cost and the recall reality are subtler.

How long context got cheap

Three innovations stacked. Flash Attention reduced memory by an order of magnitude. Grouped-query attention cut KV cache memory at decode. Rotary position encoding (RoPE) generalised to longer sequences without retraining.

Effective recall

“Needle in haystack” tests measure if a model can retrieve a specific fact from a long context. Most frontier models are 90%+ accurate at ≤128K and degrade above. Synthesis tasks (combining facts from multiple long-context locations) drop more sharply, often 60-70% even when single-fact retrieval is 95%.

The lesson: long context exists; effective long context is shorter than the headline number.

Cost math

Pricing is roughly linear in tokens. A 1M-token query might cost $5-30 depending on provider. Latency: prefill of 1M tokens takes 5-30 seconds even with optimised servers. Streaming hides the prefill time.

Long context vs RAG

Long context wins when the task needs broad reasoning across many documents. RAG wins when the corpus is large, queries are narrow, or freshness matters. Most production systems use both: RAG for first-pass retrieval, long context for the final synthesis stage when the relevant subset still won’t fit in 32K.