AI & ML Intermediate By Samson Tanimawo, PhD Published Jul 29, 2025 10 min read

The Transformer Architecture, Explained

Every modern LLM is a transformer. The architecture is simpler than it looks: an embedding layer, a stack of identical blocks, and an output head. Once you can draw it, you can read any model paper.

The architectural shape

A transformer is three things in series:

  1. Token embedding + positional encoding: turn token IDs into vectors that carry both meaning and position information.
  2. A stack of N identical transformer blocks: each block updates each token’s representation, mixing information across positions.
  3. An output head: a linear projection from the final layer’s representations to the output (next-token logits, classification scores, etc.).

That’s the whole architecture. The depth N varies (12 in BERT-base, 24 in BERT-large, 96+ in modern frontier models), but the shape is the same.

The transformer block

One block contains four operations, applied with skip connections:

  1. Layer normalisation.
  2. Multi-head self-attention.
  3. Layer normalisation.
  4. Feedforward network (two linear layers with a non-linearity).

Each operation has a residual (skip) connection: the input is added to the output. Without residuals, deep transformers don’t train. With them, the layers can refine the representation incrementally.

The feedforward network is sometimes called “the MLP block.” It’s a per-token transformation: each token’s representation gets passed through a wider hidden layer (typically 4x the model dimension) and back. Most of the model’s parameters live here.

Encoder vs decoder

The original transformer paper had both an encoder (bidirectional attention) and a decoder (causal attention plus cross-attention to the encoder output). Designed for machine translation.

Modern descendants split into:

The 2018-2024 trend was decoder-only winning everywhere because predicting the next token turns out to be a sufficient training objective for almost everything.

Positional encoding

Self-attention is permutation-invariant: shuffle the tokens and the operation gives the same result for each (just reordered). To preserve word order, the model needs to know where each token sits.

Three approaches you’ll see:

RoPE has won. Almost every open-weight model in 2024-2025 uses RoPE or a close variant.

Modern variants

The 2017 architecture has been refined:

A 2025 frontier model is a transformer with all of these refinements layered on. The high-level architecture is identical to 2017; the details are an engineering decade of refinement.

Why it won

Three reasons the transformer beat RNNs and CNNs:

The Bitter Lesson applied: simple architectures that scale beat clever architectures that don’t. Transformers happened to be both simple and scalable. Eight years later, every model you’ve heard of is one.