Parallelism in LLM Training
Every form of parallelism used to train large language models — data, tensor, pipeline, sequence, expert, context, and 3D — explained with animated diagrams, memory formulas, and the trade-offs that determine which to use.
Why Parallelism — The Scale Problem
A single A100 GPU has 80 GB of HBM memory. A 70B parameter model in BF16 needs ~140 GB just for weights — before gradients, optimizer states, or activations. Training GPT-4 scale models requires petaflops sustained across thousands of GPUs. Parallelism is not an optimization; it is a prerequisite.
The three bottlenecks that parallelism addresses:
Model weights + gradients + optimizer states + activations exceed what a single device holds. Model parallelism (tensor, pipeline, ZeRO) distributes memory across devices.
Even if the model fits in memory, training on a small batch is slow. Data parallelism multiplies throughput by processing more data simultaneously across many devices.
Every parallelism strategy requires synchronizing information between devices. The bandwidth and latency of inter-device links (NVLink, InfiniBand, PCIe) determines which parallelism strategies are practical at what scale.
The Complete Map
| Parallelism Type | What it splits | Memory saving | Comm. pattern | Primary use case |
|---|---|---|---|---|
| Data Parallel (DP) | Training data (mini-batches) | None (replicates model) | All-Reduce gradients | Batch size scaling; baseline throughput |
| ZeRO-1 | Optimizer states | 4× opt states | All-Gather + Reduce-Scatter | Reduce optimizer memory redundancy |
| ZeRO-2 | + Gradients | 8× grads+opt | All-Gather + Reduce-Scatter | Large models on DP nodes |
| ZeRO-3 / FSDP | + Parameters | 64× all states | All-Gather params (3× per step) | Largest models; 70B+ on commodity |
| Tensor Parallel (TP) | Individual layer weight matrices | 1/N params per device | All-Reduce per layer | FFN and Attention within a node |
| Pipeline Parallel (PP) | Layers across stages | 1/N layers per device | Point-to-point activations | Very deep models; across nodes |
| Sequence Parallel (SP) | Sequence dimension | 1/N activation mem | All-Gather + Reduce-Scatter | Long context; activation memory |
| Context Parallel (CP) | Sequence (K,V distributed) | 1/N KV cache | All-Reduce attn outputs | Extremely long sequences (100k+) |
| Expert Parallel (EP) | MoE experts across devices | 1/N expert params | All-to-All token dispatch | MoE models; expert capacity |
| 3D Parallel | DP × TP × PP combined | Multiplicative | All three patterns | Production frontier model training |
Data Parallelism
The simplest parallelism strategy: replicate the full model on every GPU, partition the mini-batch across GPUs, synchronize gradients after each backward pass. No architectural changes needed. The communication pattern is a single All-Reduce of gradients at the end of backward.
DDP vs DP in PyTorch
DataParallel (DP) uses a single process, multiple threads. The parameter server pattern: GPU 0 scatters data, collects gradients, broadcasts updates. GPU 0 becomes a bottleneck. Deprecated for serious use.
DistributedDataParallel (DDP) uses one process per GPU with NCCL all-reduce. All-reduce is symmetric — no parameter server bottleneck. Standard for data parallelism today. Gradient synchronization is overlapped with backward computation using gradient bucketing.
For a 7B model in mixed precision: weights ~14 GB, gradients ~14 GB, Adam optimizer states (fp32 weights + m + v) ~84 GB. Total per GPU: ~112 GB. Every GPU holds all of this identically. No memory reduction at all. ZeRO was invented to fix exactly this.
ZeRO — Zero Redundancy Optimizer
ZeRO keeps the data-parallel training pattern but eliminates the massive memory redundancy. In standard DDP every GPU stores a full identical copy of everything — parameters, gradients, and optimizer states. ZeRO asks: why do all 4 GPUs hold the same copy of the momentum buffer? They don't have to.
What Is Sharding?
Sharding means partitioning a tensor across N GPUs so each GPU holds exactly 1/N of it. No GPU can use its shard alone — but together they hold the complete tensor without redundancy.
Memory Math — Where Every Byte Comes From
📖 Beginner guide: Why 2Ψ for parameters? Why 12Ψ for optimizer states? How does 7B become 112 GB?
Every number in the memory formula comes from one fact: different data types use different numbers of bytes per value. A "parameter" is just a floating-point number. How many bytes it takes depends on its precision.
fp16 = "16-bit float". Each number is stored in 16 binary digits = 2 bytes. It has less precision than fp32 but is fast on modern GPU Tensor Cores and uses half the memory.
7B parameters × 2 bytes = 14,000,000,000 bytes ≈ 14 GB
Same for gradients: each gradient has the same shape as its parameter, so also 2Ψ bytes.
fp32 = "32-bit float" (standard Python float). The Adam optimizer must use fp32 for numerical stability — the tiny gradient corrections would round to zero in fp16. So even though the model runs in fp16, the optimizer bookkeeping is in fp32.
7B × 4 bytes = 28 GB per Adam buffer
Ⓐ fp32 master copy — the high-precision "true" weights. fp16 params are a lossy rounded snapshot of these.
Ⓑ momentum mₜ — running average of past gradients (where am I moving?).
Ⓒ variance vₜ — running average of squared gradients (how noisy is my gradient?).
3 buffers × 4 bytes × 7B = 84 GB
This is why optimizer states dominate memory: 84 GB out of 112 GB total.
The Memory Math — Where the Numbers Come From (N=4, 7B model)
Each GPU is "responsible" for a 1/N slice of the parameters. That means it must hold the optimizer states (fp32 master weight + momentum + variance) only for its slice. This is where the savings come from.
All three ZeRO stages shard the optimizer states. ZeRO-1 is the minimum that does this. GPU 0 stores the fp32 master weight, momentum, and variance for only its 1.75B parameter slice. That's 12 × 1.75B bytes = 21 GB — and it never grows. Stages 2 and 3 add gradient and parameter sharding on top, but the optimizer is already at 21 GB from ZeRO-1 onward.
Tensor Parallelism
Tensor Parallelism (TP) splits individual weight matrices across GPUs. Each GPU holds one slice of the weights, computes a partial result, then one All-Reduce combines the pieces. No pipeline waiting, no layer boundaries needed — it happens inside a single layer.
📖 Beginner’s guide: How are weights stored in a neural network?
To understand tensor parallelism you first need to see exactly what a weight matrix is in memory — and why it’s a rectangle of numbers.
Column j of W holds all the weights connecting every input to neuron j. There are d_out such columns — one per output neuron. The computation is simply:
output = activation( input × W ) where input is [1×d_in], W is [d_in×d_out], output is [1×d_out]
Memory cost: d_in × d_out floating-point values. A layer with 4096 inputs and 16384 hidden units: 4096×16384 = 67M values = 128 MB in fp16.
h = GeLU( x · W1 ) expands from d to 4d (e.g. 4096 → 16384)
y = h · W2 projects back from 4d to d (16384 → 4096)
For a 7B model (d=4096): W1 = [4096×16384] = 64M values = 128 MB. W2 = same. Both together = 256 MB — just for one layer’s FFN. A 32-layer model: 32 × 256 MB = 8 GB just for FFN weights.
For N-way TP with L layers: communication = 2 × L × 2 × b × s × d bytes per forward pass (2 messages per layer: one after Attention output projection, one after FFN W2). A 70B model with L=80, d=8192, s=2048, b=1: ~5.4 TB of data moved per forward pass with N=8. This is why TP requires NVLink (600 GB/s), not InfiniBand (200 GB/s). For cross-node TP, the communication cost dominates — keep TP within a single node.
Attention Tensor Parallelism
Multi-Head Attention is also split: attention heads are distributed across GPUs. With 32 heads and 8-way TP, each GPU handles 4 heads. The Q, K, V projection matrices and the output projection follow the same column/row-parallel pattern as FFN. An All-Reduce is needed only after the output projection — not between Q, K, V and the attention computation.
Pipeline Parallelism
Pipeline Parallelism (PP) assigns consecutive groups of layers to different GPUs (stages). The model's depth is split across devices; activations flow forward stage-by-stage, and gradients flow backward. Unlike TP, PP communication is only between adjacent stages, which makes it suitable for cross-node (InfiniBand) connections.
The Bubble Problem
In a naive pipeline with N stages, the pipeline fill phase wastes N-1 time slots. The drain phase (after the last stage finishes forward) wastes another N-1 slots. Bubble fraction = (N-1) / (N-1+M) where M is the number of micro-batches.
GPipe requires O(M×N) activation memory (holds all M micro-batch activations for all N stages until backward). 1F1B requires only O(N) activations because each stage immediately does its backward pass once the subsequent stage no longer needs its output. For N=8, M=8: GPipe needs 8× more activation memory than 1F1B.
Sequence Parallelism & Ring Attention
Standard TP leaves the LayerNorm and Dropout activations un-parallelized — they still require the full sequence on one GPU. Sequence Parallelism (SP) extends TP along the sequence dimension: these non-tensor-parallel operations are now split across GPUs along seq_len, completely removing per-GPU activation memory dependence on sequence length.
For very long contexts (128k+ tokens), even the attention computation itself is too large for one GPU. Ring Attention distributes the sequence across a ring of GPUs: each GPU holds a segment of Q, K, V, then passes K and V around the ring, computing its Q against every K, V shard it receives.
📖 Beginner’s guide: What does attention actually need? Why do long sequences break one GPU?
Ring Attention only makes sense once you see exactly what the attention operation requires. Three facts:
That alone exceeds an 80 GB GPU once you add weights and activations. One GPU physically cannot hold the K,V for a 128k-token sequence.
Megatron-LM Sequence Parallelism (SP)
A simpler form of SP used in Megatron. During the non-TP operations (LayerNorm, Dropout), the sequence is split using Reduce-Scatter and All-Gather to keep each GPU's memory proportional to s/N. This is combined with TP for a complete solution: no single GPU ever holds the full (batch × seq × hidden) activation tensor.
Expert Parallelism
Mixture-of-Experts (MoE) models replace one FFN per layer with many expert FFNs plus a router. Each token is processed by only 1–2 experts (sparse compute), but storing all experts on every GPU would multiply FFN memory by the expert count. Expert Parallelism (EP) solves this: place each expert on a different GPU, and physically ship tokens to wherever their assigned expert lives.
The router is a tiny learned linear layer: it scores each token against every expert and picks the top-k. Because tokens on GPU 0 may be routed to an expert on GPU 3, EP requires an All-to-All shuffle — every GPU simultaneously sends tokens to and receives tokens from every other GPU. This happens twice per MoE layer: once to dispatch tokens to experts, once to return results.
If the router sends most tokens to one popular expert, that GPU becomes a hotspot while others idle, and its receive buffer overflows (tokens beyond the capacity factor get dropped). Training adds an auxiliary load-balancing loss to keep expert assignment roughly uniform — the same problem and machinery covered in the MoE guide. DeepSeek-V3 instead uses auxiliary-loss-free balancing via per-expert bias terms.
Each MoE layer moves every token’s hidden vector across GPUs twice: volume ≈ 2 × b × s × d × k bytes per layer. All-to-All latency is the typical bottleneck for cross-node EP, which is why production systems (DeepSeek-V3, Mixtral inference) keep EP groups within a node or overlap All-to-All with compute. EP composes with the other axes: DeepSeek-V3 trains with EP × PP × DP × ZeRO simultaneously.
Context Parallelism
Context Parallelism (CP) is a newer variant introduced by Megatron-LM for extreme long-context training (128k–1M tokens). Unlike Sequence Parallelism (which integrates with TP and handles the non-attention ops), CP focuses specifically on distributing the K/V cache across GPUs during attention computation.
Each GPU holds a T/N segment of the full sequence. For self-attention, Q is local, but K and V must span the full sequence. CP uses an All-Gather of K and V at the start of each attention layer, distributes the computation, then uses Reduce-Scatter to aggregate attention outputs. Or — more efficiently — uses the Ring Attention pattern to avoid the large All-Gather entirely.
Without CP: KV cache for 128k tokens at d=8192 = 128k × 2 × 8192 × 2 bytes ≈ 4 GB per layer. With 80 layers: 320 GB just for KV cache. With CP N=8: 40 GB total, 5 GB per GPU per layer. Makes long-context training feasible on standard hardware.
SP integrates with TP and handles everything (LayerNorm, Dropout, Attention) along the sequence dimension. CP is specifically for the KV attention bottleneck. They can be combined: use SP for non-attention layers and CP/Ring Attention for the attention layers themselves.
3D Parallelism — Combining Everything
Production LLM training (GPT-4, Llama-3 70B, DeepSeek-V3) combines Data, Tensor, and Pipeline parallelism simultaneously. This is called 3D Parallelism [Narayanan et al., 2021]. The N_total GPUs are organized into a 3D grid: N_dp × N_tp × N_pp = N_total.
How to Choose N_dp, N_tp, N_pp
| Factor | Prefer higher TP when... | Prefer higher PP when... | Prefer higher DP when... |
|---|---|---|---|
| Communication bandwidth | High NVLink bandwidth within node | Low cross-node bandwidth okay | Moderate InfiniBand acceptable |
| Model type | Large FFN hidden dim (d_ff >> d) | Many layers (deep model) | Good hardware efficiency achieved |
| Batch size | Any | Large batch (more micro-batches → less bubble) | Large global batch target |
| Memory pressure | Single layer too large for 1 GPU | Too many layers for 1 GPU | Layers fit, need throughput |
Communication Primitives
Every parallelism strategy reduces to a small set of collective communication operations. Understanding them reveals what's happening at the hardware level and why certain parallelism strategies require high-bandwidth links.
Communication Cost Analysis
Selection Guide
Given a model and a GPU cluster, how do you choose your parallelism strategy? This is an engineering optimization with many trade-offs.
| Model Size | GPU Count | Recommended Strategy | Notes |
|---|---|---|---|
| 1B–7B | 1–8 (1 node) | ZeRO-2/3 or FSDP | Usually fits with ZeRO-2. No PP needed. TP if very memory tight. |
| 7B–70B | 8–64 (1–8 nodes) | ZeRO-3 + TP(2–4) | TP within nodes. ZeRO-3 across nodes. PP if cross-node bw limited. |
| 70B–500B | 64–512 | TP(8) + PP(4–8) + DP | Megatron-style 3D parallelism. PP across nodes. TP within node. |
| 500B+ (MoE) | 512+ | TP + PP + DP + EP | Add Expert Parallelism. 4D or 5D. DeepSeek-V3: TP16 + PP16 + DP4 + EP8. |
| Long context (128k+) | Any | Add SP or CP | Sequence/Context parallelism orthogonal to TP/PP/DP. Combine as needed. |
Start with ZeRO-3/FSDP — it requires no architectural changes and handles most model sizes up to ~70B on 64 GPUs. Add TP when a single layer is too large for one GPU (before or after ZeRO). Add PP when TP is maxed out within a node and you need to span nodes. Add SP/CP when sequence length exceeds GPU memory. Add EP only for MoE architectures. The general principle: use the parallelism with the cheapest communication cost first, and layer in expensive communication only when necessary.
References
- Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC 2020. arXiv:1910.02054
- Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053
- Huang, Y. et al. (2019). GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. NeurIPS 2019. arXiv:1811.06965
- Narayanan, D. et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC 2021. arXiv:2104.04473
- Narayanan, D. et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019. arXiv:1806.03377
- Korthikanti, V. et al. (2022). Reducing Activation Recomputation in Large Transformer Models (Sequence Parallelism). MLSys 2023. arXiv:2205.05198
- Liu, H. et al. (2023). Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889
- Brandon, W. et al. (2023). Striped Attention: Faster Ring Attention for Causal Transformers. arXiv:2311.09431
- Jacobs, S.A. et al. (2023). DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv:2309.14509
- DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437 (5D parallelism)