Tokens, Embeddings, and Context Windows, Explained
Three concepts you need to understand before you can reason about any LLM cost, latency, or capability question. Each is simpler than the terminology suggests.
Tokens: the unit of language for language models
An LLM doesn’t see characters and it doesn’t see words. It sees tokens. A token is a sub-word fragment that the model’s tokeniser treats as atomic.
Tokenisers are built once per model family. They learn a vocabulary (typically 50,000-250,000 tokens) that covers common words as single tokens and splits rare words into multiple sub-word pieces. Common words like “the”, “of”, and “kubernetes” are each one token. Unusual words like “prestidigitation” might be three.
Every token gets an integer ID. The sentence “Hello, Nova.” might become [15496, 11, 19015, 13] depending on the tokeniser. The model only ever works with these integers.
Why 1 word ≠ 1 token
Two rules of thumb for English:
- 1 word ≈ 1.3 tokens on average
- 1 token ≈ 4 English characters
For other languages the ratio is different. Chinese and Japanese tokenisers often get 1-2 characters per token. Agglutinative languages like Finnish or Turkish tokenise into smaller pieces than English because word boundaries carry less information. Code tokenises densely when it’s “common” (`for`, `if`, `=`) and sparsely for unusual identifiers.
Why it matters: billing, context limits, and latency are all measured in tokens, not words. A “128k context window” is 128,000 tokens, which is roughly 96,000 English words or 150 double-spaced pages. Your 10,000-word document is about 13,000 tokens.
Embeddings: words as vectors
The first thing the model does with an input token is look up its embedding: a dense vector of floating-point numbers (usually 2,048-12,288 dimensions in modern models) that represents the token’s “meaning.”
Embeddings aren’t hand-assigned. They’re learned during training. Over the course of seeing billions of sentences, the model adjusts each token’s embedding so that tokens used in similar contexts end up with similar vectors.
The embedding layer is just a giant lookup table: 50,000 rows (one per token in the vocabulary), each row a vector. Look up the token ID, read the vector, feed it into the next layer.
The embedding vectors flow up through the transformer layers, being transformed and mixed with other positions’ vectors at every step. The output of the last layer is a predicted next-token distribution.
The geometry of meaning
Because embeddings are just vectors, you can do math on them. The famous early result:
embedding("king") − embedding("man") + embedding("woman") ≈ embedding("queen")
Concepts form directions in the embedding space. The “gender” direction is roughly (woman − man). The “royalty” direction is something like (king − man) or (queen − woman). Cities cluster near cities, foods near foods, and so on.
This is the foundation of semantic search. You compute an embedding for a query, compute embeddings for a million documents, and find the documents whose embeddings are nearest (by cosine similarity). The search is “semantic” because the embedding captures meaning, not just keywords.
Context windows: what they really mean
The context window is the maximum number of tokens the model can process at once. It includes both the input (your prompt, system prompt, RAG retrievals, chat history) and the output (what the model generates).
Context windows in 2025:
- 2K-8K: legacy models.
- 32K-128K: most workhorse APIs.
- 1M-2M: frontier Claude and Gemini offerings.
What actually matters isn’t just the total size but the effective use. “Needle in a haystack” tests measure whether a model reliably finds a single piece of information buried in a long context. Modern models are good to roughly 80% of their advertised context; above that, recall drops.
The practical math you need
Three calculations come up constantly.
1. Cost per request. APIs price input tokens and output tokens separately. Your cost is (input tokens × input rate) + (output tokens × output rate). Input is cheaper than output, often by 5×.
2. Time per request. The input is processed once (prefill). The output generates one token per pass (decode). Prefill is fast and parallel. Decode is sequential, typically 50-150 tokens/second on a single API call. A 4,000-token response takes 25-80 seconds.
3. Context packing. If your prompt is 200,000 tokens long, not all of it helps. Every token is a latency cost and sometimes a confusion cost (the model has to decide what to attend to). Compress. Summarise old turns. Retrieve only relevant documents. A lean 10k-token prompt often outperforms a 200k-token one for the same task.
Internalising these three calculations, cost, latency, context-packing, makes LLM engineering feel less like magic and more like the capacity-planning problem it actually is.