Distributed Training: Data, Tensor, Pipeline Parallelism
A model with 100B parameters does not fit on one GPU. The training cluster has to split the work three ways: across data, across tensors, across pipeline stages. Each split has its own constraints.
The three dimensions
Distributed training splits work three ways. Each addresses a different bottleneck:
- Data parallelism: same model on each GPU, different data per GPU. Splits batch.
- Tensor parallelism: split each layer’s weight matrices across GPUs. Splits parameters.
- Pipeline parallelism: split the model into sequential stages, one stage per GPU group. Splits layers.
Modern frontier training combines all three (“3D parallelism”) because no single dimension scales to thousands of GPUs.
Data parallelism
Each GPU has a full copy of the model and processes a different microbatch. After backward pass, gradients are averaged across GPUs (all-reduce) before the optimiser step.
Strengths: simplest, easiest to debug, fairly linear scaling up to 64-128 GPUs. Weakness: requires the full model + optimiser state on every GPU. Doesn’t help when the model itself doesn’t fit.
ZeRO sharding (DeepSpeed) is data parallelism with the optimiser state, gradients, and parameters split across GPUs. Each GPU still effectively does data-parallel work but holds a fraction of the state. The single biggest scaling win for medium-sized clusters.
Tensor parallelism
Within a single layer (e.g., a feedforward layer with weight matrix W of shape [4096, 16384]), the matrix is split across GPUs along the column dimension. Each GPU computes a partial output; outputs are gathered via all-reduce.
Strengths: works when the model layers are too big for one GPU. Necessary for > 30B parameter models on H100 hardware.
Weaknesses: communication-heavy, every layer requires all-reduce. Bandwidth-bound. Typically capped at 8 GPUs (within an NVLink-connected node).
Pipeline parallelism
Layers are split across GPU groups. Group 1 handles layers 0-19, group 2 handles 20-39, etc. Microbatches flow through the pipeline like a factory line.
Strengths: scales well across nodes (lower bandwidth requirements than tensor parallelism). Necessary for very large models that span entire racks.
Weaknesses: pipeline bubbles, idle time at the start and end of each batch. Mitigated by 1F1B and interleaved schedules but never eliminated.
3D parallelism
Frontier models combine all three. A typical 70B-parameter training run might use:
- Tensor parallelism = 8 (within a node).
- Pipeline parallelism = 4 (across nodes).
- Data parallelism = 16 (across the cluster).
Total: 8 × 4 × 16 = 512 GPUs. Each dimension solves a different scaling problem; together they unlock training that no single dimension can.
The choice of split is governed by communication cost (TP needs NVLink-fast intra-node), memory cost (PP requires all stages have similar memory), and compute cost (DP scales batch size). Frameworks like DeepSpeed and Megatron-LM automate the choice given a hardware topology.