AI & ML Advanced By Samson Tanimawo, PhD Published Oct 14, 2025 8 min read

Distributed Training: Data, Tensor, Pipeline Parallelism

A model with 100B parameters does not fit on one GPU. The training cluster has to split the work three ways: across data, across tensors, across pipeline stages. Each split has its own constraints.

The three dimensions

Distributed training splits work three ways. Each addresses a different bottleneck:

Modern frontier training combines all three (“3D parallelism”) because no single dimension scales to thousands of GPUs.

Data parallelism

Each GPU has a full copy of the model and processes a different microbatch. After backward pass, gradients are averaged across GPUs (all-reduce) before the optimiser step.

Strengths: simplest, easiest to debug, fairly linear scaling up to 64-128 GPUs. Weakness: requires the full model + optimiser state on every GPU. Doesn’t help when the model itself doesn’t fit.

ZeRO sharding (DeepSpeed) is data parallelism with the optimiser state, gradients, and parameters split across GPUs. Each GPU still effectively does data-parallel work but holds a fraction of the state. The single biggest scaling win for medium-sized clusters.

Tensor parallelism

Within a single layer (e.g., a feedforward layer with weight matrix W of shape [4096, 16384]), the matrix is split across GPUs along the column dimension. Each GPU computes a partial output; outputs are gathered via all-reduce.

Strengths: works when the model layers are too big for one GPU. Necessary for > 30B parameter models on H100 hardware.

Weaknesses: communication-heavy, every layer requires all-reduce. Bandwidth-bound. Typically capped at 8 GPUs (within an NVLink-connected node).

Pipeline parallelism

Layers are split across GPU groups. Group 1 handles layers 0-19, group 2 handles 20-39, etc. Microbatches flow through the pipeline like a factory line.

Strengths: scales well across nodes (lower bandwidth requirements than tensor parallelism). Necessary for very large models that span entire racks.

Weaknesses: pipeline bubbles, idle time at the start and end of each batch. Mitigated by 1F1B and interleaved schedules but never eliminated.

3D parallelism

Frontier models combine all three. A typical 70B-parameter training run might use:

Total: 8 × 4 × 16 = 512 GPUs. Each dimension solves a different scaling problem; together they unlock training that no single dimension can.

The choice of split is governed by communication cost (TP needs NVLink-fast intra-node), memory cost (PP requires all stages have similar memory), and compute cost (DP scales batch size). Frameworks like DeepSpeed and Megatron-LM automate the choice given a hardware topology.