Infrastructure Deep Dive

Parallelism in LLM Training

Every form of parallelism used to train large language models — data, tensor, pipeline, sequence, expert, context, and 3D — explained with animated diagrams, memory formulas, and the trade-offs that determine which to use.

💻 10 parallelism types 🎬 10 animated SVGs 🧮 Memory formulas 📡 Communication analysis
LLM Parallelism overview
00

Why Parallelism — The Scale Problem

A single A100 GPU has 80 GB of HBM memory. A 70B parameter model in BF16 needs ~140 GB just for weights — before gradients, optimizer states, or activations. Training GPT-4 scale models requires petaflops sustained across thousands of GPUs. Parallelism is not an optimization; it is a prerequisite.

The three bottlenecks that parallelism addresses:

💾 Memory wall

Model weights + gradients + optimizer states + activations exceed what a single device holds. Model parallelism (tensor, pipeline, ZeRO) distributes memory across devices.

⚡ Compute ceiling

Even if the model fits in memory, training on a small batch is slow. Data parallelism multiplies throughput by processing more data simultaneously across many devices.

📡 Communication cost

Every parallelism strategy requires synchronizing information between devices. The bandwidth and latency of inter-device links (NVLink, InfiniBand, PCIe) determines which parallelism strategies are practical at what scale.

The Complete Map

Parallelism TypeWhat it splitsMemory savingComm. patternPrimary use case
Data Parallel (DP)Training data (mini-batches)None (replicates model)All-Reduce gradientsBatch size scaling; baseline throughput
ZeRO-1Optimizer states4× opt statesAll-Gather + Reduce-ScatterReduce optimizer memory redundancy
ZeRO-2+ Gradients8× grads+optAll-Gather + Reduce-ScatterLarge models on DP nodes
ZeRO-3 / FSDP+ Parameters64× all statesAll-Gather params (3× per step)Largest models; 70B+ on commodity
Tensor Parallel (TP)Individual layer weight matrices1/N params per deviceAll-Reduce per layerFFN and Attention within a node
Pipeline Parallel (PP)Layers across stages1/N layers per devicePoint-to-point activationsVery deep models; across nodes
Sequence Parallel (SP)Sequence dimension1/N activation memAll-Gather + Reduce-ScatterLong context; activation memory
Context Parallel (CP)Sequence (K,V distributed)1/N KV cacheAll-Reduce attn outputsExtremely long sequences (100k+)
Expert Parallel (EP)MoE experts across devices1/N expert paramsAll-to-All token dispatchMoE models; expert capacity
3D ParallelDP × TP × PP combinedMultiplicativeAll three patternsProduction frontier model training
01

Data Parallelism

The simplest parallelism strategy: replicate the full model on every GPU, partition the mini-batch across GPUs, synchronize gradients after each backward pass. No architectural changes needed. The communication pattern is a single All-Reduce of gradients at the end of backward.

\theta \leftarrow \theta - \eta \cdot \frac{1}{N}\sum_{i=1}^{N} \nabla_\theta \mathcal{L}(x_i, \theta)
Each GPU computes gradients on its shard of data. All-Reduce averages them. Equivalent to training on the full batch.
Figure 1 — Data Parallelism: Replicated Model, Partitioned Batch
① DATASET: split into 4 equal shards (one per GPU) Shard 0 (samples 1–8) 1 2 3 4 5 6 7 8 Shard 1 (samples 9–16) 9 10 11 12 13 14 15 16 Shard 2 (samples 17–24) 17 18 19 20 21 22 23 24 Shard 3 (samples 25–32) 25 26 27 28 29 30 31 32 ② Each GPU runs a FULL forward + backward pass through its IDENTICAL copy of the model GPU 0 — Full model θ Input Hidden Output ∇θ₀ computed GPU 1 — Full model θ Input Hidden Output ∇θ₁ computed GPU 2 — Full model θ Input Hidden Output ∇θ₂ computed GPU 3 — Full model θ Input Hidden Output ∇θ₃ computed same θ same θ same θ ③ Gradients from all GPUs are averaged via All-Reduce All-Reduce ∇θ_avg = (∇θ₀ + ∇θ₁ + ∇θ₂ + ∇θ₃) / 4 ④ θ ← θ − η·∇θ_avg ④ θ ← θ − η·∇θ_avg ④ θ ← θ − η·∇θ_avg ④ θ ← θ − η·∇θ_avg All 4 GPUs now hold the same updated weights — ready for the next batch ⚠ Memory cost: Every GPU stores the FULL model (weights + gradients + optimizer states) — zero memory savings 7B model: ~14 GB weights + 14 GB gradients + 84 GB optimizer = 112 GB per GPU (same on all 4 GPUs)
Each GPU holds a complete model copy and processes its own data shard. Gradients flow toward the All-Reduce node, are averaged, and the identical updated weight is broadcast back to all GPUs. Memory cost: 4× full model (all GPUs hold identical weights — no savings). Throughput: N× faster training.

DDP vs DP in PyTorch

DataParallel (DP) uses a single process, multiple threads. The parameter server pattern: GPU 0 scatters data, collects gradients, broadcasts updates. GPU 0 becomes a bottleneck. Deprecated for serious use.

DistributedDataParallel (DDP) uses one process per GPU with NCCL all-reduce. All-reduce is symmetric — no parameter server bottleneck. Standard for data parallelism today. Gradient synchronization is overlapped with backward computation using gradient bucketing.

📊 Memory Cost of Pure Data Parallelism

For a 7B model in mixed precision: weights ~14 GB, gradients ~14 GB, Adam optimizer states (fp32 weights + m + v) ~84 GB. Total per GPU: ~112 GB. Every GPU holds all of this identically. No memory reduction at all. ZeRO was invented to fix exactly this.

02

ZeRO — Zero Redundancy Optimizer

ZeRO keeps the data-parallel training pattern but eliminates the massive memory redundancy. In standard DDP every GPU stores a full identical copy of everything — parameters, gradients, and optimizer states. ZeRO asks: why do all 4 GPUs hold the same copy of the momentum buffer? They don't have to.

What Is Sharding?

Sharding means partitioning a tensor across N GPUs so each GPU holds exactly 1/N of it. No GPU can use its shard alone — but together they hold the complete tensor without redundancy.

Concept — Sharding a 7B parameter model across 4 GPUs
DDP (no sharding) — each GPU stores ALL 7B parameters = 4 × 7B = 28B parameters stored total (3× waste!) GPU 0: params[0 … 7B] 7B copy GPU 1: params[0 … 7B] 7B copy GPU 2: params[0 … 7B] 7B copy GPU 3: params[0 … 7B] 7B copy 4× waste ↓ ZeRO-3 Sharding ↓ Slice the 7B params into 4 equal chunks. GPU i owns chunk i. Total stored = still 7B (no redundancy). ZeRO-3 (fully sharded) — each GPU stores 1.75B parameters = 4 × 1.75B = 7B total (1× no waste!) GPU 0: params[0 … 1.75B] 1.75B GPU 1: params[1.75B…3.5B] 1.75B GPU 2: params[3.5B…5.25B] 1.75B GPU 3: params[5.25B…7B] 1.75B 1× efficient But the forward pass needs ALL parameters! ← solved by All-Gather before each layer Before each layer: all GPUs broadcast their shard → each GPU temporarily reconstructs the full layer weights → forward pass → free the gathered copy

Memory Math — Where Every Byte Comes From

📖  Beginner guide: Why 2Ψ for parameters? Why 12Ψ for optimizer states? How does 7B become 112 GB?

Every number in the memory formula comes from one fact: different data types use different numbers of bytes per value. A "parameter" is just a floating-point number. How many bytes it takes depends on its precision.

fp16
16 bits = 2 bytes per value
Ψ values × 2 bytes = 2Ψ
Used for: model parameters (weights) and gradients during the forward + backward pass.
fp16 = "16-bit float". Each number is stored in 16 binary digits = 2 bytes. It has less precision than fp32 but is fast on modern GPU Tensor Cores and uses half the memory.
7B parameters × 2 bytes = 14,000,000,000 bytes ≈ 14 GB
Same for gradients: each gradient has the same shape as its parameter, so also 2Ψ bytes.
fp32
32 bits = 4 bytes per value
Ψ values × 4 bytes = 4Ψ
Used for: the Adam optimizer's three internal buffers.
fp32 = "32-bit float" (standard Python float). The Adam optimizer must use fp32 for numerical stability — the tiny gradient corrections would round to zero in fp16. So even though the model runs in fp16, the optimizer bookkeeping is in fp32.
7B × 4 bytes = 28 GB per Adam buffer
Adam = 3 buffers
fp32 copy + m + v
3 × 4Ψ = 12Ψ
Adam keeps three separate fp32 copies of the parameters:
Ⓐ fp32 master copy — the high-precision "true" weights. fp16 params are a lossy rounded snapshot of these.
Ⓑ momentum mₜ — running average of past gradients (where am I moving?).
Ⓒ variance vₜ — running average of squared gradients (how noisy is my gradient?).
3 buffers × 4 bytes × 7B = 84 GB
This is why optimizer states dominate memory: 84 GB out of 112 GB total.
🧮 How “7B parameters” becomes “112 GB” — step by step
Count every byte:
fp16 params: 7×10⁹ × 2 = 14×10⁹ bytes

fp16 grads: 7×10⁹ × 2 = 14×10⁹ bytes

fp32 copy: 7×10⁹ × 4 = 28×10⁹ bytes

fp32 moment: 7×10⁹ × 4 = 28×10⁹ bytes

fp32 variance:7×10⁹ × 4 = 28×10⁹ bytes

Total = 112×10⁹ bytes
Convert to GB (using 1 GB = 10⁹ bytes):
112×10⁹ bytes ÷ 10⁹ = 112 GB

Note: computer RAM uses binary GB (1 GB = 2³⁰ = 1.074×10⁹ bytes), so the precise answer is 104.3 binary-GB. But ML practitioners always use the round “10⁹ = 1 GB” approximation, which gives the clean 112 GB number.
M_{\text{total}} = \underbrace{2\Psi}_{\text{fp16 params}} + \underbrace{2\Psi}_{\text{fp16 grads}} + \underbrace{4\Psi + 4\Psi + 4\Psi}_{\text{fp32: master copy + momentum + variance}} = 16\Psi \text{ bytes}
Total memory per GPU in mixed-precision training with Adam. For a 7B model: 16 × 7×10⁹ = 112×10⁹ bytes = 112 GB.
M_{\text{ZeRO-1}} = 2\Psi + 2\Psi + \frac{12\Psi}{N} \qquad M_{\text{ZeRO-2}} = 2\Psi + \frac{2\Psi}{N} + \frac{12\Psi}{N} \qquad M_{\text{ZeRO-3}} = \frac{16\Psi}{N}
N = number of GPUs. For 7B model, N=4: ZeRO-1 = 14+14+21 = 49 GB. ZeRO-2 = 14+3.5+21 = 38.5 GB. ZeRO-3 = 16×7×10⁹/4 = 28 GB (exactly 4× less than baseline).

The Memory Math — Where the Numbers Come From (N=4, 7B model)

Each GPU is "responsible" for a 1/N slice of the parameters. That means it must hold the optimizer states (fp32 master weight + momentum + variance) only for its slice. This is where the savings come from.

Memory breakdown per GPU — 7B model (Ψ = 7B params), N = 4 GPUs
Stage fp16 params 2Ψ fp16 grads 2Ψ optimizer states 12Ψ (fp32 copy+mom+var) Total / GPU ZeRO-0 (DDP) nothing sharded 14 GB 2Ψ = 2×7B bytes 14 GB 2Ψ = 2×7B bytes 84 GB 12Ψ = 4Ψ+4Ψ+4Ψ = 28+28+28 GB 112 GB = 16Ψ (16×7B bytes) ZeRO-1 opt states ÷N 14 GB full (needed for fwd) 14 GB full (needed for bwd) 21 GB ✂ 84 GB ÷ 4 GPUs = 21 GB/GPU 49 GB 14+14+21 (2.3× saving) ZeRO-2 opt + grads ÷N 14 GB full (needed for fwd) 3.5 GB ✂ 14 GB ÷ 4 GPUs = 3.5 GB/GPU 21 GB ✂ 84 GB ÷ 4 GPUs = 21 GB/GPU 38.5 GB 14+3.5+21 (2.9× saving) ZeRO-3 / FSDP everything ÷N 3.5 GB ✂ 14 GB ÷ 4 = 3.5 GB/GPU 3.5 GB ✂ 14 GB ÷ 4 = 3.5 GB/GPU 21 GB ✂ 84 GB ÷ 4 = 21 GB/GPU 28 GB 3.5+3.5+21 (4× saving) ✂ = sharded: GPU i holds optimizer states ONLY for params[i×1.75B … (i+1)×1.75B] → (12Ψ)/N = 84/4 = 21 GB in all stages (the sharding is already accounted for)
⚠ Why optimizer states are always 21 GB/GPU (not 84 GB)

All three ZeRO stages shard the optimizer states. ZeRO-1 is the minimum that does this. GPU 0 stores the fp32 master weight, momentum, and variance for only its 1.75B parameter slice. That's 12 × 1.75B bytes = 21 GB — and it never grows. Stages 2 and 3 add gradient and parameter sharding on top, but the optimizer is already at 21 GB from ZeRO-1 onward.

Figure 2 — Interactive: Memory per GPU at Each ZeRO Stage
GPU 0
variance 28 GB
momentum 28 GB
fp32 copy 28 GB
grads 14 GB
params 14 GB
112 GB
fully duplicated
GPU 1
variance 28 GB
momentum 28 GB
fp32 copy 28 GB
grads 14 GB
params 14 GB
112 GB
fully duplicated
GPU 2
variance 28 GB
momentum 28 GB
fp32 copy 28 GB
grads 14 GB
params 14 GB
112 GB
fully duplicated
GPU 3
variance 28 GB
momentum 28 GB
fp32 copy 28 GB
grads 14 GB
params 14 GB
112 GB
fully duplicated
■ fp16 params ■ fp16 grads ■ fp32 param copy ■ fp32 momentum ■ fp32 variance
ZeRO-0 (DDP): No sharding — every GPU holds all 112 GB identically
Communication: All-Reduce gradients each step
Toggle between stages. The bars physically shrink as data is sharded away. Note that optimizer states (the 3 largest bars) shrink first in ZeRO-1, then gradients in ZeRO-2, then parameters in ZeRO-3. The optimizer state bars always represent the shard for that GPU's 1.75B parameters — that's why they're 21 GB and not 84 GB from ZeRO-1 onward.
03

Tensor Parallelism

Tensor Parallelism (TP) splits individual weight matrices across GPUs. Each GPU holds one slice of the weights, computes a partial result, then one All-Reduce combines the pieces. No pipeline waiting, no layer boundaries needed — it happens inside a single layer.

📖  Beginner’s guide: How are weights stored in a neural network?

To understand tensor parallelism you first need to see exactly what a weight matrix is in memory — and why it’s a rectangle of numbers.

One neuron = one column
Each column = all the weights feeding into that neuron
d_in rows × d_out cols
A fully-connected layer with d_in input features and d_out output neurons is exactly one matrix W of shape [d_in × d_out].
Column j of W holds all the weights connecting every input to neuron j. There are d_out such columns — one per output neuron. The computation is simply:
output = activation( input × W )  where input is [1×d_in], W is [d_in×d_out], output is [1×d_out]
Memory cost: d_in × d_out floating-point values. A layer with 4096 inputs and 16384 hidden units: 4096×16384 = 67M values = 128 MB in fp16.
A Transformer FFN = 2 matmuls
Expand → activate → project back
W1: [d×4d]  W2: [4d×d]
A standard Transformer FFN block has two weight matrices:
h = GeLU( x · W1 )  expands from d to 4d (e.g. 4096 → 16384)
y = h · W2  projects back from 4d to d (16384 → 4096)
For a 7B model (d=4096): W1 = [4096×16384] = 64M values = 128 MB. W2 = same. Both together = 256 MB — just for one layer’s FFN. A 32-layer model: 32 × 256 MB = 8 GB just for FFN weights.
🧮 Why matrix multiplication is parallelizable
The output neuron j only uses column j of W. It is completely independent of all other output neurons. This means you can give GPU 0 columns [0..N/2) and GPU 1 columns [N/2..N), and they compute their neurons in parallel — with zero dependency between them. The only synchronization needed is combining the final results (one All-Reduce). This is exactly what Tensor Parallelism exploits.
W_1 \in \mathbb{R}^{d \times 4d},\quad h = \text{GeLU}(x\,W_1) \in \mathbb{R}^{4d},\quad y = h\,W_2 \in \mathbb{R}^{d}
One FFN forward pass. x is one token’s embedding [1×d]. W1 expands it to [1×4d]. W2 projects it back to [1×d]. Tensor parallelism splits W1 column-wise and W2 row-wise.
Figure 3 — Fixed MLP (d=4, d_ff=8): Without vs With Tensor Parallelism (N=2 GPUs)
1 / 5
A concrete 2-layer MLP: input d=4, hidden d_ff=8, output d=4. Left: one GPU, all 64 weights, sequential. Right: 2-GPU TP — W1 column-split, W2 row-split, both GPUs compute in parallel, one All-Reduce combines partial outputs. Each GPU holds 32 weights (half the memory).
📊 TP Communication Formula

For N-way TP with L layers: communication = 2 × L × 2 × b × s × d bytes per forward pass (2 messages per layer: one after Attention output projection, one after FFN W2). A 70B model with L=80, d=8192, s=2048, b=1: ~5.4 TB of data moved per forward pass with N=8. This is why TP requires NVLink (600 GB/s), not InfiniBand (200 GB/s). For cross-node TP, the communication cost dominates — keep TP within a single node.

Attention Tensor Parallelism

Multi-Head Attention is also split: attention heads are distributed across GPUs. With 32 heads and 8-way TP, each GPU handles 4 heads. The Q, K, V projection matrices and the output projection follow the same column/row-parallel pattern as FFN. An All-Reduce is needed only after the output projection — not between Q, K, V and the attention computation.

04

Pipeline Parallelism

Pipeline Parallelism (PP) assigns consecutive groups of layers to different GPUs (stages). The model's depth is split across devices; activations flow forward stage-by-stage, and gradients flow backward. Unlike TP, PP communication is only between adjacent stages, which makes it suitable for cross-node (InfiniBand) connections.

The Idea — Take One Deep Model, Cut It Into Consecutive Stages
The original model: 32 transformer layers, normally living on ONE device input → layer 0 → layer 1 → … → layer 31 → output (e.g. 7B model: too big + too slow for one GPU) emb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 head layers 0–7 layers 8–15 layers 16–23 layers 24–31 GPU 0 = Stage 0 layers 0 1 2 3 layers 4 5 6 7 + embedding 1/4 of model params GPU 1 = Stage 1 layers 8 9 10 11 layers 12 13 14 15 1/4 of model params GPU 2 = Stage 2 layers 16 17 18 19 layers 20 21 22 23 1/4 of model params GPU 3 = Stage 3 layers 24 25 26 27 layers 28 29 30 31 + LM head 1/4 of model params Each GPU only stores and computes its own 8 layers. Activations hop GPU→GPU at stage boundaries (cyan arrows). Only 3 communication points per forward pass — tiny vs Tensor Parallelism — works fine over InfiniBand across nodes
Figure 4 — Pipeline Parallelism on a 4-Layer MLP: Naive vs GPipe vs 1F1B
1 / 5
Naive: the whole batch flows through GPU 0 → 1 → 2 → 3 (forward, blue), then gradients flow back 3 → 2 → 1 → 0 (backward, red). At any moment only ONE GPU works — the other three idle. Watch the Gantt chart at the bottom fill in: the grey area is the bubble (wasted compute).

The Bubble Problem

In a naive pipeline with N stages, the pipeline fill phase wastes N-1 time slots. The drain phase (after the last stage finishes forward) wastes another N-1 slots. Bubble fraction = (N-1) / (N-1+M) where M is the number of micro-batches.

\text{Bubble fraction} = \frac{N-1}{N-1+M} \xrightarrow{M\gg N} 0
Make M >> N (many micro-batches per batch) to reduce bubble. GPipe uses M = N×k. 1F1B uses M = N, achieving the same bubble with 1/(N-1+1) memory cost.
📊 1F1B Memory Advantage

GPipe requires O(M×N) activation memory (holds all M micro-batch activations for all N stages until backward). 1F1B requires only O(N) activations because each stage immediately does its backward pass once the subsequent stage no longer needs its output. For N=8, M=8: GPipe needs 8× more activation memory than 1F1B.

05

Sequence Parallelism & Ring Attention

Standard TP leaves the LayerNorm and Dropout activations un-parallelized — they still require the full sequence on one GPU. Sequence Parallelism (SP) extends TP along the sequence dimension: these non-tensor-parallel operations are now split across GPUs along seq_len, completely removing per-GPU activation memory dependence on sequence length.

For very long contexts (128k+ tokens), even the attention computation itself is too large for one GPU. Ring Attention distributes the sequence across a ring of GPUs: each GPU holds a segment of Q, K, V, then passes K and V around the ring, computing its Q against every K, V shard it receives.

📖  Beginner’s guide: What does attention actually need? Why do long sequences break one GPU?

Ring Attention only makes sense once you see exactly what the attention operation requires. Three facts:

Fact 1: every token looks at every token
Q·Kᵀ is all-pairs
T queries × T keys
Each token produces a query Q, a key K, and a value V. Token 9’s output is a weighted mix of all tokens’ values, with weights = Q₉·Kⱼ for every j. So to compute attention for token 9, you need K and V of all T tokens available. There is no way around this — it’s what attention is.
Fact 2: memory grows with sequence length
K,V storage ∝ T
128k tokens ≈ 64 GB of K,V
For a 7B model (d=4096, 32 layers, fp16): storing K and V for the full sequence costs 2 × T × d × layers × 2 bytes = 2×128k×4096×32×2 ≈ 64 GB
That alone exceeds an 80 GB GPU once you add weights and activations. One GPU physically cannot hold the K,V for a 128k-token sequence.
Fact 3: naive splitting breaks attention
each GPU sees only its slice
12 of 16 pairs missing
Natural idea: give each of 4 GPUs a quarter of the tokens. GPU 0 gets tokens 1–4, GPU 1 gets 5–8, etc. Each computes Q,K,V for its own tokens. But now GPU 0’s queries can only see GPU 0’s keys — tokens 1–4 attend to tokens 1–4 and are blind to tokens 5–16. In the 4×4 coverage grid (which Q-chunk has seen which K,V-chunk), only the diagonal is filled. The other 12 cells are missing. That is NOT attention.
💡 The Ring Attention trick
Don’t gather all K,V to every GPU (that would need full memory again). Instead: keep Q parked, and pass the K,V chunks around a ring, one hop at a time. Each GPU computes attention between its (fixed) Q and whatever K,V chunk is currently visiting, accumulates the partial result, then forwards the chunk to its neighbor. After N−1 = 3 rotations, every Q has met every K,V — the full 4×4 grid is covered — yet each GPU only ever held 2 chunks at a time (its own + one visitor). Watch it happen below.
Figure 5 — Without vs With Ring Attention: K,V Chunks Rotate, Coverage Grid Fills
1 / 5
Top: without sequence parallelism, one GPU must hold K,V for all 16 tokens — at 128k tokens this exceeds GPU capacity (red bar past the yellow capacity line). Bottom: with Ring Attention, each GPU keeps only its 4 Q tokens (parked, never move) plus ONE visiting K,V chunk. During rotation, slots show "…in transit…" and the colored chunks visibly travel to the next GPU (GPU 3's chunk wraps around the bottom path back to GPU 0). The coverage grid fills one diagonal per step — after 4 steps, every Q has seen every K,V: full attention at O(T/N) memory.

Megatron-LM Sequence Parallelism (SP)

A simpler form of SP used in Megatron. During the non-TP operations (LayerNorm, Dropout), the sequence is split using Reduce-Scatter and All-Gather to keep each GPU's memory proportional to s/N. This is combined with TP for a complete solution: no single GPU ever holds the full (batch × seq × hidden) activation tensor.

06

Expert Parallelism

Mixture-of-Experts (MoE) models replace one FFN per layer with many expert FFNs plus a router. Each token is processed by only 1–2 experts (sparse compute), but storing all experts on every GPU would multiply FFN memory by the expert count. Expert Parallelism (EP) solves this: place each expert on a different GPU, and physically ship tokens to wherever their assigned expert lives.

The router is a tiny learned linear layer: it scores each token against every expert and picks the top-k. Because tokens on GPU 0 may be routed to an expert on GPU 3, EP requires an All-to-All shuffle — every GPU simultaneously sends tokens to and receives tokens from every other GPU. This happens twice per MoE layer: once to dispatch tokens to experts, once to return results.

g = \text{softmax}(x \cdot W_r) \in \mathbb{R}^{E},\qquad \text{expert}(x) = \text{top-}k(g),\qquad y = \sum_{e \in \text{top-}k} g_e \cdot \text{FFN}_e(x)
The router: W_r is a small [d×E] matrix (E = number of experts). Each token x gets routed to its top-k experts (k=1 or 2) and the outputs are gate-weighted. Only the selected experts run — that’s the sparsity.
Figure 6.5 — Expert Parallelism, Step by Step: Experts Split, Tokens Travel
1 / 6
Six steps, one at a time: ① all 4 expert MLPs packed on one GPU — 4× FFN memory while each token only ever uses one. ② Each expert moves to its own GPU (whole experts — no matrix slicing). ③ The router assigns each token to one expert; the colored ring on each token shows its destination. ④ All-to-All #1: tokens travel to their expert’s GPU. ⑤ Each expert processes only its received tokens. ⑥ All-to-All #2: processed tokens return home. Cost: 2 All-to-Alls per MoE layer; gain: 1 expert of memory and 1 expert of compute per GPU.
⚠️ Load balancing — the catch

If the router sends most tokens to one popular expert, that GPU becomes a hotspot while others idle, and its receive buffer overflows (tokens beyond the capacity factor get dropped). Training adds an auxiliary load-balancing loss to keep expert assignment roughly uniform — the same problem and machinery covered in the MoE guide. DeepSeek-V3 instead uses auxiliary-loss-free balancing via per-expert bias terms.

📡 Communication cost

Each MoE layer moves every token’s hidden vector across GPUs twice: volume ≈ 2 × b × s × d × k bytes per layer. All-to-All latency is the typical bottleneck for cross-node EP, which is why production systems (DeepSeek-V3, Mixtral inference) keep EP groups within a node or overlap All-to-All with compute. EP composes with the other axes: DeepSeek-V3 trains with EP × PP × DP × ZeRO simultaneously.

07

Context Parallelism

Context Parallelism (CP) is a newer variant introduced by Megatron-LM for extreme long-context training (128k–1M tokens). Unlike Sequence Parallelism (which integrates with TP and handles the non-attention ops), CP focuses specifically on distributing the K/V cache across GPUs during attention computation.

Each GPU holds a T/N segment of the full sequence. For self-attention, Q is local, but K and V must span the full sequence. CP uses an All-Gather of K and V at the start of each attention layer, distributes the computation, then uses Reduce-Scatter to aggregate attention outputs. Or — more efficiently — uses the Ring Attention pattern to avoid the large All-Gather entirely.

Figure 6 — Context Parallelism, Step by Step: Shard the KV-Cache Along the Sequence
1 / 5
One cut at a time: the 320 GB KV cache (sequence × layers) doesn’t fit one GPU → slice it VERTICALLY along the sequence (contrast: pipeline parallelism slices horizontally by layers) → each GPU receives one token-range slice across all layers → during attention the slices exchange K,V via the Ring Attention pattern (Figure 5) → memory per GPU = 320/N, so longer context just means more GPUs.
💻 Memory scaling

Without CP: KV cache for 128k tokens at d=8192 = 128k × 2 × 8192 × 2 bytes ≈ 4 GB per layer. With 80 layers: 320 GB just for KV cache. With CP N=8: 40 GB total, 5 GB per GPU per layer. Makes long-context training feasible on standard hardware.

📡 CP vs SP

SP integrates with TP and handles everything (LayerNorm, Dropout, Attention) along the sequence dimension. CP is specifically for the KV attention bottleneck. They can be combined: use SP for non-attention layers and CP/Ring Attention for the attention layers themselves.

08

3D Parallelism — Combining Everything

Production LLM training (GPT-4, Llama-3 70B, DeepSeek-V3) combines Data, Tensor, and Pipeline parallelism simultaneously. This is called 3D Parallelism [Narayanan et al., 2021]. The N_total GPUs are organized into a 3D grid: N_dp × N_tp × N_pp = N_total.

First, Baby Steps — Watch One Model + One Dataset Get Carved Into a 2×2×2 Grid
1 / 5
The three cuts are applied ONE AT A TIME, in order. Cut 1 (Pipeline): the MLP is split by layers — front half / back half. Cut 2 (Tensor): each stage’s layer weights are split in half — top neurons / bottom neurons. Cut 3 (Data): the entire 4-GPU assembly is cloned, and the dataset is split so each replica trains on different samples. Result: 2 (DP) × 2 (TP) × 2 (PP) = 8 GPUs, each with a unique coordinate.
Figure 7 — 3D Parallelism: GPU Organization Grid (N_dp=2, N_tp=4, N_pp=2)
PP Stage 0 (layers 0–L/2) GPU 0 DP0·TP0·PP0 W[:,0:d/4] GPU 1 DP0·TP1·PP0 W[:,d/4:d/2] GPU 2 DP0·TP2·PP0 W[:,d/2:3d/4] GPU 3 DP0·TP3·PP0 W[:,3d/4:d] ← TP All-Reduce → PP Stage 1 (layers L/2–L) GPU 4 DP0·TP0·PP1 GPU 5 DP0·TP1·PP1 GPU 6 DP0·TP2·PP1 GPU 7 DP0·TP3·PP1 ↑ PP P2P DP replica 1 (GPUs 8–15) Identical structure, different data shard GPU 8: DP1·TP0·PP0 ... GPU 15: DP1·TP3·PP1 All-Reduce gradients with DP replica 0 at end of backward DP ∇θ Megatron-DeepSpeed Configuration Tensor Parallel (All-Reduce, NVLink) Pipeline Parallel (P2P, InfiniBand) Data Parallel (All-Reduce, InfiniBand) Example: Llama-3 70B on 512 GPUs N_tp=8 (within node) N_pp=4 N_dp=16 8 × 4 × 16 = 512 GPUs ✓
3D Parallelism organizes GPUs into a 3D grid. Within a node: Tensor Parallelism using NVLink (fast, latency-sensitive). Across nodes within a PP stage: Pipeline Parallelism sends activations between adjacent stages. Across all nodes with same layers: Data Parallelism all-reduces gradients at the end of backward. Each GPU knows exactly what it holds (which DP/TP/PP coordinates it occupies).

How to Choose N_dp, N_tp, N_pp

FactorPrefer higher TP when...Prefer higher PP when...Prefer higher DP when...
Communication bandwidthHigh NVLink bandwidth within nodeLow cross-node bandwidth okayModerate InfiniBand acceptable
Model typeLarge FFN hidden dim (d_ff >> d)Many layers (deep model)Good hardware efficiency achieved
Batch sizeAnyLarge batch (more micro-batches → less bubble)Large global batch target
Memory pressureSingle layer too large for 1 GPUToo many layers for 1 GPULayers fit, need throughput
09

Communication Primitives

Every parallelism strategy reduces to a small set of collective communication operations. Understanding them reveals what's happening at the hardware level and why certain parallelism strategies require high-bandwidth links.

Figure 8 — The Four Core Collective Operations
All-Reduce each GPU gets sum of all a₀ a₁ a₂ a₃ Σaᵢ reduce Σ Σ Σ Σ Used by: DDP gradient sync Volume: 2(N-1)/N × data All-Gather each GPU gets all shards s₀ s₁ s₂ s₃ collect [s₀..s₃] [s₀..s₃] [s₀..s₃] [s₀..s₃] Used by: ZeRO-3 param gather Volume: (N-1)/N × data Reduce-Scatter reduce + distribute shards a₀b₀ c₀d₀ GPU 0 a₁b₁ c₁d₁ GPU 1 Σ+shard split by i Σaᵢ Σbᵢ GPU 0 GPU 1 Used by: ZeRO-2/3 grad shard Volume: (N-1)/N × data All-Reduce = All-Gather ∘ Reduce-Scatter (fused for efficiency) All-to-All each GPU sends unique data to each →0 →1 →2 →3 →0 →1 →2 →3 →0 →1 →2 →3 →0 →1 →2 →3 0 ← me 1 ← me 2 ← me 3 ← me Used by: MoE Expert Parallelism O(N²) messages — most expensive collective Volume: (N-1)/N × data per GPU (same bandwidth as Reduce-Scatter / All-Gather)
All four collective operations used in distributed training. All-Reduce = gradient sync in DDP. All-Gather = re-materialise sharded params (ZeRO-3). Reduce-Scatter = first half of ZeRO gradient sharding; All-Reduce = Reduce-Scatter then All-Gather fused. All-to-All = Expert Parallelism token dispatch — the only O(N²) collective and the reason EP is communication-heavy.

Communication Cost Analysis

\text{All-Reduce: } \frac{2(N-1)}{N}\cdot M \quad\text{All-Gather: } \frac{N-1}{N}\cdot M \quad\text{Reduce-Scatter: } \frac{N-1}{N}\cdot M \quad\text{All-to-All: } \frac{N-1}{N}\cdot M
Communication volumes (M = data size). All-Reduce = All-Gather + Reduce-Scatter ≈ 2×. All-to-All sends N different messages, one to each peer — latency scales worse with N.
10

Selection Guide

Given a model and a GPU cluster, how do you choose your parallelism strategy? This is an engineering optimization with many trade-offs.

Model SizeGPU CountRecommended StrategyNotes
1B–7B1–8 (1 node)ZeRO-2/3 or FSDPUsually fits with ZeRO-2. No PP needed. TP if very memory tight.
7B–70B8–64 (1–8 nodes)ZeRO-3 + TP(2–4)TP within nodes. ZeRO-3 across nodes. PP if cross-node bw limited.
70B–500B64–512TP(8) + PP(4–8) + DPMegatron-style 3D parallelism. PP across nodes. TP within node.
500B+ (MoE)512+TP + PP + DP + EPAdd Expert Parallelism. 4D or 5D. DeepSeek-V3: TP16 + PP16 + DP4 + EP8.
Long context (128k+)AnyAdd SP or CPSequence/Context parallelism orthogonal to TP/PP/DP. Combine as needed.
💡 The key decision framework

Start with ZeRO-3/FSDP — it requires no architectural changes and handles most model sizes up to ~70B on 64 GPUs. Add TP when a single layer is too large for one GPU (before or after ZeRO). Add PP when TP is maxed out within a node and you need to span nodes. Add SP/CP when sequence length exceeds GPU memory. Add EP only for MoE architectures. The general principle: use the parallelism with the cheapest communication cost first, and layer in expensive communication only when necessary.

References

  1. Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC 2020. arXiv:1910.02054
  2. Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053
  3. Huang, Y. et al. (2019). GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. NeurIPS 2019. arXiv:1811.06965
  4. Narayanan, D. et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC 2021. arXiv:2104.04473
  5. Narayanan, D. et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019. arXiv:1806.03377
  6. Korthikanti, V. et al. (2022). Reducing Activation Recomputation in Large Transformer Models (Sequence Parallelism). MLSys 2023. arXiv:2205.05198
  7. Liu, H. et al. (2023). Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889
  8. Brandon, W. et al. (2023). Striped Attention: Faster Ring Attention for Causal Transformers. arXiv:2311.09431
  9. Jacobs, S.A. et al. (2023). DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv:2309.14509
  10. DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437 (5D parallelism)