The Transformer Architecture, Explained
Every modern LLM is a transformer. The architecture is simpler than it looks: an embedding layer, a stack of identical blocks, and an output head. Once you can draw it, you can read any model paper.
The architectural shape
A transformer is three things in series:
- Token embedding + positional encoding: turn token IDs into vectors that carry both meaning and position information.
- A stack of N identical transformer blocks: each block updates each token’s representation, mixing information across positions.
- An output head: a linear projection from the final layer’s representations to the output (next-token logits, classification scores, etc.).
That’s the whole architecture. The depth N varies (12 in BERT-base, 24 in BERT-large, 96+ in modern frontier models), but the shape is the same.
The transformer block
One block contains four operations, applied with skip connections:
- Layer normalisation.
- Multi-head self-attention.
- Layer normalisation.
- Feedforward network (two linear layers with a non-linearity).
Each operation has a residual (skip) connection: the input is added to the output. Without residuals, deep transformers don’t train. With them, the layers can refine the representation incrementally.
The feedforward network is sometimes called “the MLP block.” It’s a per-token transformation: each token’s representation gets passed through a wider hidden layer (typically 4x the model dimension) and back. Most of the model’s parameters live here.
Encoder vs decoder
The original transformer paper had both an encoder (bidirectional attention) and a decoder (causal attention plus cross-attention to the encoder output). Designed for machine translation.
Modern descendants split into:
- Encoder-only (BERT, RoBERTa): bidirectional attention. Used for embedding, classification, and other tasks where the whole sequence is available.
- Decoder-only (GPT, Claude, Llama, Mistral): causal attention. Predicts next tokens. The dominant architecture for LLMs.
- Encoder-decoder (T5, BART, original Transformer): both. Translation, summarisation. Less common today.
The 2018-2024 trend was decoder-only winning everywhere because predicting the next token turns out to be a sufficient training objective for almost everything.
Positional encoding
Self-attention is permutation-invariant: shuffle the tokens and the operation gives the same result for each (just reordered). To preserve word order, the model needs to know where each token sits.
Three approaches you’ll see:
- Sinusoidal (original 2017): fixed sin/cos encodings added to token embeddings. Simple, generalises beyond trained context lengths.
- Learned absolute (BERT, GPT-2): a learned embedding per position. Doesn’t generalise past training context length.
- Rotary (RoPE): applies rotation matrices to query and key vectors based on position. Better long-context behaviour. Used in Llama, Mistral, GPT-NeoX.
- ALiBi: applies a linear bias to attention scores based on relative position. Generalises well to long contexts.
RoPE has won. Almost every open-weight model in 2024-2025 uses RoPE or a close variant.
Modern variants
The 2017 architecture has been refined:
- Pre-norm vs post-norm: layer normalisation moved before the attention/feedforward instead of after. Trains more stably for deep models.
- RMS norm: a faster, simpler variant of layer norm. Marginal accuracy impact, real speedup.
- SwiGLU activation: replaces ReLU in the feedforward block. Slightly better empirically.
- Grouped-query attention: shares K and V across multiple Q heads. Massive memory bandwidth reduction at decode time.
- Mixture of Experts: replaces dense feedforward with sparse expert routing. Discussed in our MoE post.
A 2025 frontier model is a transformer with all of these refinements layered on. The high-level architecture is identical to 2017; the details are an engineering decade of refinement.
Why it won
Three reasons the transformer beat RNNs and CNNs:
- Parallelism: attention computes all tokens simultaneously. RNNs are inherently sequential. Transformers train much faster on GPUs.
- Long-range dependencies: every token sees every other token directly. RNNs propagate information one step at a time and forget over long distances.
- Scaling: transformer performance keeps improving with more parameters and data. RNNs and CNNs plateau earlier.
The Bitter Lesson applied: simple architectures that scale beat clever architectures that don’t. Transformers happened to be both simple and scalable. Eight years later, every model you’ve heard of is one.