AI & ML Intermediate By Samson Tanimawo, PhD Published Jul 22, 2025 10 min read

The Attention Mechanism, Decoded

Attention is the operation that lets a token decide which other tokens matter to it. Once you see the math, query, key, value, transformers stop being mysterious.

The intuition

Earlier neural networks (RNNs, CNNs) processed sequences with a fixed local view. A word looked at its neighbours; long-range relationships were learned slowly or not at all.

Attention turns that around. For every token, the network computes a weighted average over all other tokens, where the weights say “how much does each other token matter to me right now.” Long-range dependencies become first-class.

For the sentence “The cat that the dog chased ran away,” understanding which animal ran requires connecting “ran” with “cat” across several other words. Attention does this in one operation, regardless of distance.

Query, key, value

For each token, the network produces three vectors:

All three are linear projections of the token’s embedding. The Q, K, V matrices are learned during training. Each token gets all three; each plays a different role in the operation.

Token A’s attention to token B is computed as the dot product of A’s query and B’s key, scaled and softmaxed. The result is a weight; A’s output is the weighted sum of all tokens’ values.

The single formula that powers everything

The whole attention operation is:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

Read it like this:

  1. Q K^T: dot every query with every key. Result: an N×N matrix of compatibility scores (where N is sequence length).
  2. / sqrt(d_k): divide by sqrt of the dimension. Prevents the softmax from saturating when d_k is large.
  3. softmax: turn each row into a probability distribution. For each token, “how much weight do I put on each other token?”
  4. V: weight-sum the value vectors using those weights. Result: an N×d output, one updated representation per token.

That’s the entire mechanism. Every transformer is a stack of variations on this core operation.

Multi-head attention

One attention operation captures one kind of relationship. Multi-head attention runs H attentions in parallel (typically H=8 to 96), each with smaller Q/K/V dimensions. The outputs are concatenated and projected back.

Each head learns to attend to different patterns: one head might track grammatical agreement, another might track topical similarity, another might track positional patterns. Empirically, multi-head outperforms single-head at the same parameter count.

Modern frontier models use 96-128 attention heads in their largest layers. The cost is multiplicative in heads, but each head is smaller, so total compute is similar.

Masking for autoregression

For language models, a token at position i shouldn’t see tokens at positions i+1, i+2, etc. (because it’s trying to predict them).

The fix is a causal mask: before the softmax, set the upper triangle of the attention scores to -infinity. The softmax sends those to zero. Each token sees only itself and earlier tokens.

This is what makes a transformer autoregressive. Encoder transformers (like BERT) don’t use causal masking and can see the whole context bidirectionally; decoder transformers (like GPT) do.

How attention scaled from 2017 to today

The 2017 “Attention Is All You Need” paper had attention with sequence lengths of 512. The compute is O(N^2) in sequence length. Doubling N quadruples cost.

Three innovations made longer contexts tractable:

Modern frontier models combine all three. Million-token context windows (Claude, Gemini) are possible because the underlying attention has been re-engineered, not because someone just made the model bigger.