Parallelism in LLM Training

00

Why Parallelism — The Scale Problem

A single A100 GPU has 80 GB of HBM memory. A 70B parameter model in BF16 needs ~140 GB just for weights — before gradients, optimizer states, or activations. Training GPT-4 scale models requires petaflops sustained across thousands of GPUs. Parallelism is not an optimization; it is a prerequisite.

The three bottlenecks that parallelism addresses:

💾 Memory wall

Model weights + gradients + optimizer states + activations exceed what a single device holds. Model parallelism (tensor, pipeline, ZeRO) distributes memory across devices.

⚡ Compute ceiling

Even if the model fits in memory, training on a small batch is slow. Data parallelism multiplies throughput by processing more data simultaneously across many devices.

📡 Communication cost

Every parallelism strategy requires synchronizing information between devices. The bandwidth and latency of inter-device links (NVLink, InfiniBand, PCIe) determines which parallelism strategies are practical at what scale.

The Complete Map

Parallelism Type	What it splits	Memory saving	Comm. pattern	Primary use case
Data Parallel (DP)	Training data (mini-batches)	None (replicates model)	All-Reduce gradients	Batch size scaling; baseline throughput
ZeRO-1	Optimizer states	4× opt states	All-Gather + Reduce-Scatter	Reduce optimizer memory redundancy
ZeRO-2	+ Gradients	8× grads+opt	All-Gather + Reduce-Scatter	Large models on DP nodes
ZeRO-3 / FSDP	+ Parameters	64× all states	All-Gather params (3× per step)	Largest models; 70B+ on commodity
Tensor Parallel (TP)	Individual layer weight matrices	1/N params per device	All-Reduce per layer	FFN and Attention within a node
Pipeline Parallel (PP)	Layers across stages	1/N layers per device	Point-to-point activations	Very deep models; across nodes
Sequence Parallel (SP)	Sequence dimension	1/N activation mem	All-Gather + Reduce-Scatter	Long context; activation memory
Context Parallel (CP)	Sequence (K,V distributed)	1/N KV cache	All-Reduce attn outputs	Extremely long sequences (100k+)
Expert Parallel (EP)	MoE experts across devices	1/N expert params	All-to-All token dispatch	MoE models; expert capacity
3D Parallel	DP × TP × PP combined	Multiplicative	All three patterns	Production frontier model training

01

Data Parallelism

The simplest parallelism strategy: replicate the full model on every GPU, partition the mini-batch across GPUs, synchronize gradients after each backward pass. No architectural changes needed. The communication pattern is a single All-Reduce of gradients at the end of backward.

\theta \leftarrow \theta - \eta \cdot \frac{1}{N}\sum_{i=1}^{N} \nabla_\theta \mathcal{L}(x_i, \theta)

Each GPU computes gradients on its shard of data. All-Reduce averages them. Equivalent to training on the full batch.

Figure 1 — Data Parallelism: Replicated Model, Partitioned Batch

Each GPU holds a complete model copy and processes its own data shard. Gradients flow toward the All-Reduce node, are averaged, and the identical updated weight is broadcast back to all GPUs. Memory cost: 4× full model (all GPUs hold identical weights — no savings). Throughput: N× faster training.

DDP vs DP in PyTorch

DataParallel (DP) uses a single process, multiple threads. The parameter server pattern: GPU 0 scatters data, collects gradients, broadcasts updates. GPU 0 becomes a bottleneck. Deprecated for serious use.

DistributedDataParallel (DDP) uses one process per GPU with NCCL all-reduce. All-reduce is symmetric — no parameter server bottleneck. Standard for data parallelism today. Gradient synchronization is overlapped with backward computation using gradient bucketing.

📊 Memory Cost of Pure Data Parallelism

For a 7B model in mixed precision: weights ~14 GB, gradients ~14 GB, Adam optimizer states (fp32 weights + m + v) ~84 GB. Total per GPU: ~112 GB. Every GPU holds all of this identically. No memory reduction at all. ZeRO was invented to fix exactly this.

02

ZeRO — Zero Redundancy Optimizer

ZeRO keeps the data-parallel training pattern but eliminates the massive memory redundancy. In standard DDP every GPU stores a full identical copy of everything — parameters, gradients, and optimizer states. ZeRO asks: why do all 4 GPUs hold the same copy of the momentum buffer? They don't have to.

What Is Sharding?

Sharding means partitioning a tensor across N GPUs so each GPU holds exactly 1/N of it. No GPU can use its shard alone — but together they hold the complete tensor without redundancy.

Concept — Sharding a 7B parameter model across 4 GPUs

Memory Math — Where Every Byte Comes From

📖 Beginner guide: Why 2Ψ for parameters? Why 12Ψ for optimizer states? How does 7B become 112 GB?

Every number in the memory formula comes from one fact: different data types use different numbers of bytes per value. A "parameter" is just a floating-point number. How many bytes it takes depends on its precision.

fp16

16 bits = 2 bytes per value

Ψ values × 2 bytes = 2Ψ

Used for: model parameters (weights) and gradients during the forward + backward pass.
fp16 = "16-bit float". Each number is stored in 16 binary digits = 2 bytes. It has less precision than fp32 but is fast on modern GPU Tensor Cores and uses half the memory.
7B parameters × 2 bytes = 14,000,000,000 bytes ≈ 14 GB
Same for gradients: each gradient has the same shape as its parameter, so also 2Ψ bytes.

fp32

32 bits = 4 bytes per value

Ψ values × 4 bytes = 4Ψ

Used for: the Adam optimizer's three internal buffers.
fp32 = "32-bit float" (standard Python float). The Adam optimizer must use fp32 for numerical stability — the tiny gradient corrections would round to zero in fp16. So even though the model runs in fp16, the optimizer bookkeeping is in fp32.
7B × 4 bytes = 28 GB per Adam buffer

Adam = 3 buffers

fp32 copy + m + v

3 × 4Ψ = 12Ψ

Adam keeps three separate fp32 copies of the parameters:
Ⓐ fp32 master copy — the high-precision "true" weights. fp16 params are a lossy rounded snapshot of these.
Ⓑ momentum mₜ — running average of past gradients (where am I moving?).
Ⓒ variance vₜ — running average of squared gradients (how noisy is my gradient?).
3 buffers × 4 bytes × 7B = 84 GB
This is why optimizer states dominate memory: 84 GB out of 112 GB total.

🧮 How “7B parameters” becomes “112 GB” — step by step

Count every byte:

fp16 params: 7×10⁹ × 2 = 14×10⁹ bytes

fp16 grads: 7×10⁹ × 2 = 14×10⁹ bytes

fp32 copy: 7×10⁹ × 4 = 28×10⁹ bytes

fp32 moment: 7×10⁹ × 4 = 28×10⁹ bytes

fp32 variance:7×10⁹ × 4 = 28×10⁹ bytes

Total = 112×10⁹ bytes

Convert to GB (using 1 GB = 10⁹ bytes):

112×10⁹ bytes ÷ 10⁹ = 112 GB

Note: computer RAM uses binary GB (1 GB = 2³⁰ = 1.074×10⁹ bytes), so the precise answer is 104.3 binary-GB. But ML practitioners always use the round “10⁹ = 1 GB” approximation, which gives the clean 112 GB number.

M_{\text{total}} = \underbrace{2\Psi}_{\text{fp16 params}} + \underbrace{2\Psi}_{\text{fp16 grads}} + \underbrace{4\Psi + 4\Psi + 4\Psi}_{\text{fp32: master copy + momentum + variance}} = 16\Psi \text{ bytes}

Total memory per GPU in mixed-precision training with Adam. For a 7B model: 16 × 7×10⁹ = 112×10⁹ bytes = 112 GB.

M_{\text{ZeRO-1}} = 2\Psi + 2\Psi + \frac{12\Psi}{N} \qquad M_{\text{ZeRO-2}} = 2\Psi + \frac{2\Psi}{N} + \frac{12\Psi}{N} \qquad M_{\text{ZeRO-3}} = \frac{16\Psi}{N}

N = number of GPUs. For 7B model, N=4: ZeRO-1 = 14+14+21 = 49 GB. ZeRO-2 = 14+3.5+21 = 38.5 GB. ZeRO-3 = 16×7×10⁹/4 = 28 GB (exactly 4× less than baseline).

The Memory Math — Where the Numbers Come From (N=4, 7B model)

Each GPU is "responsible" for a 1/N slice of the parameters. That means it must hold the optimizer states (fp32 master weight + momentum + variance) only for its slice. This is where the savings come from.

Memory breakdown per GPU — 7B model (Ψ = 7B params), N = 4 GPUs

⚠ Why optimizer states are always 21 GB/GPU (not 84 GB)

All three ZeRO stages shard the optimizer states. ZeRO-1 is the minimum that does this. GPU 0 stores the fp32 master weight, momentum, and variance for only its 1.75B parameter slice. That's 12 × 1.75B bytes = 21 GB — and it never grows. Stages 2 and 3 add gradient and parameter sharding on top, but the optimizer is already at 21 GB from ZeRO-1 onward.

Figure 2 — Interactive: Memory per GPU at Each ZeRO Stage

GPU 0

variance 28 GB

momentum 28 GB

fp32 copy 28 GB

grads 14 GB

params 14 GB

112 GB

fully duplicated

GPU 1

variance 28 GB

momentum 28 GB

fp32 copy 28 GB

grads 14 GB

params 14 GB

112 GB

fully duplicated

GPU 2

variance 28 GB

momentum 28 GB

fp32 copy 28 GB

grads 14 GB

params 14 GB

112 GB

fully duplicated

GPU 3

variance 28 GB

momentum 28 GB

fp32 copy 28 GB

grads 14 GB

params 14 GB

112 GB

fully duplicated

■ fp16 params ■ fp16 grads ■ fp32 param copy ■ fp32 momentum ■ fp32 variance

ZeRO-0 (DDP): No sharding — every GPU holds all 112 GB identically

Communication: All-Reduce gradients each step

Toggle between stages. The bars physically shrink as data is sharded away. Note that optimizer states (the 3 largest bars) shrink first in ZeRO-1, then gradients in ZeRO-2, then parameters in ZeRO-3. The optimizer state bars always represent the shard for that GPU's 1.75B parameters — that's why they're 21 GB and not 84 GB from ZeRO-1 onward.

03

Tensor Parallelism

Tensor Parallelism (TP) splits individual weight matrices across GPUs. Each GPU holds one slice of the weights, computes a partial result, then one All-Reduce combines the pieces. No pipeline waiting, no layer boundaries needed — it happens inside a single layer.

📖 Beginner’s guide: How are weights stored in a neural network?

To understand tensor parallelism you first need to see exactly what a weight matrix is in memory — and why it’s a rectangle of numbers.

One neuron = one column

Each column = all the weights feeding into that neuron

d_in rows × d_out cols

A fully-connected layer with d_in input features and d_out output neurons is exactly one matrix W of shape [d_in × d_out].
Column j of W holds all the weights connecting every input to neuron j. There are d_out such columns — one per output neuron. The computation is simply:
output = activation( input × W ) where input is [1×d_in], W is [d_in×d_out], output is [1×d_out]
Memory cost: d_in × d_out floating-point values. A layer with 4096 inputs and 16384 hidden units: 4096×16384 = 67M values = 128 MB in fp16.

A Transformer FFN = 2 matmuls

Expand → activate → project back

W1: [d×4d] W2: [4d×d]

A standard Transformer FFN block has two weight matrices:
h = GeLU( x · W1 ) expands from d to 4d (e.g. 4096 → 16384)
y = h · W2 projects back from 4d to d (16384 → 4096)
For a 7B model (d=4096): W1 = [4096×16384] = 64M values = 128 MB. W2 = same. Both together = 256 MB — just for one layer’s FFN. A 32-layer model: 32 × 256 MB = 8 GB just for FFN weights.

🧮 Why matrix multiplication is parallelizable

The output neuron j only uses column j of W. It is completely independent of all other output neurons. This means you can give GPU 0 columns [0..N/2) and GPU 1 columns [N/2..N), and they compute their neurons in parallel — with zero dependency between them. The only synchronization needed is combining the final results (one All-Reduce). This is exactly what Tensor Parallelism exploits.

W_1 \in \mathbb{R}^{d \times 4d},\quad h = \text{GeLU}(x\,W_1) \in \mathbb{R}^{4d},\quad y = h\,W_2 \in \mathbb{R}^{d}

One FFN forward pass. x is one token’s embedding [1×d]. W1 expands it to [1×4d]. W2 projects it back to [1×d]. Tensor parallelism splits W1 column-wise and W2 row-wise.

Figure 3 — Fixed MLP (d=4, d_ff=8): Without vs With Tensor Parallelism (N=2 GPUs)

1 / 5

A concrete 2-layer MLP: input d=4, hidden d_ff=8, output d=4. Left: one GPU, all 64 weights, sequential. Right: 2-GPU TP — W1 column-split, W2 row-split, both GPUs compute in parallel, one All-Reduce combines partial outputs. Each GPU holds 32 weights (half the memory).

📊 TP Communication Formula

For N-way TP with L layers: communication = 2 × L × 2 × b × s × d bytes per forward pass (2 messages per layer: one after Attention output projection, one after FFN W2). A 70B model with L=80, d=8192, s=2048, b=1: ~5.4 TB of data moved per forward pass with N=8. This is why TP requires NVLink (600 GB/s), not InfiniBand (200 GB/s). For cross-node TP, the communication cost dominates — keep TP within a single node.

Attention Tensor Parallelism

Multi-Head Attention is also split: attention heads are distributed across GPUs. With 32 heads and 8-way TP, each GPU handles 4 heads. The Q, K, V projection matrices and the output projection follow the same column/row-parallel pattern as FFN. An All-Reduce is needed only after the output projection — not between Q, K, V and the attention computation.

04

Pipeline Parallelism

Pipeline Parallelism (PP) assigns consecutive groups of layers to different GPUs (stages). The model's depth is split across devices; activations flow forward stage-by-stage, and gradients flow backward. Unlike TP, PP communication is only between adjacent stages, which makes it suitable for cross-node (InfiniBand) connections.

The Idea — Take One Deep Model, Cut It Into Consecutive Stages

Figure 4 — Pipeline Parallelism on a 4-Layer MLP: Naive vs GPipe vs 1F1B

1 / 5

Naive: the whole batch flows through GPU 0 → 1 → 2 → 3 (forward, blue), then gradients flow back 3 → 2 → 1 → 0 (backward, red). At any moment only ONE GPU works — the other three idle. Watch the Gantt chart at the bottom fill in: the grey area is the bubble (wasted compute).

The Bubble Problem

In a naive pipeline with N stages, the pipeline fill phase wastes N-1 time slots. The drain phase (after the last stage finishes forward) wastes another N-1 slots. Bubble fraction = (N-1) / (N-1+M) where M is the number of micro-batches.

\text{Bubble fraction} = \frac{N-1}{N-1+M} \xrightarrow{M\gg N} 0

Make M >> N (many micro-batches per batch) to reduce bubble. GPipe uses M = N×k. 1F1B uses M = N, achieving the same bubble with 1/(N-1+1) memory cost.

📊 1F1B Memory Advantage

GPipe requires O(M×N) activation memory (holds all M micro-batch activations for all N stages until backward). 1F1B requires only O(N) activations because each stage immediately does its backward pass once the subsequent stage no longer needs its output. For N=8, M=8: GPipe needs 8× more activation memory than 1F1B.

05

Sequence Parallelism & Ring Attention

Standard TP leaves the LayerNorm and Dropout activations un-parallelized — they still require the full sequence on one GPU. Sequence Parallelism (SP) extends TP along the sequence dimension: these non-tensor-parallel operations are now split across GPUs along seq_len, completely removing per-GPU activation memory dependence on sequence length.

For very long contexts (128k+ tokens), even the attention computation itself is too large for one GPU. Ring Attention distributes the sequence across a ring of GPUs: each GPU holds a segment of Q, K, V, then passes K and V around the ring, computing its Q against every K, V shard it receives.

📖 Beginner’s guide: What does attention actually need? Why do long sequences break one GPU?

Ring Attention only makes sense once you see exactly what the attention operation requires. Three facts:

Fact 1: every token looks at every token

Q·Kᵀ is all-pairs

T queries × T keys

Each token produces a query Q, a key K, and a value V. Token 9’s output is a weighted mix of all tokens’ values, with weights = Q₉·Kⱼ for every j. So to compute attention for token 9, you need K and V of all T tokens available. There is no way around this — it’s what attention is.

Fact 2: memory grows with sequence length

K,V storage ∝ T

128k tokens ≈ 64 GB of K,V

For a 7B model (d=4096, 32 layers, fp16): storing K and V for the full sequence costs 2 × T × d × layers × 2 bytes = 2×128k×4096×32×2 ≈ 64 GB
That alone exceeds an 80 GB GPU once you add weights and activations. One GPU physically cannot hold the K,V for a 128k-token sequence.

Fact 3: naive splitting breaks attention

each GPU sees only its slice

12 of 16 pairs missing

Natural idea: give each of 4 GPUs a quarter of the tokens. GPU 0 gets tokens 1–4, GPU 1 gets 5–8, etc. Each computes Q,K,V for its own tokens. But now GPU 0’s queries can only see GPU 0’s keys — tokens 1–4 attend to tokens 1–4 and are blind to tokens 5–16. In the 4×4 coverage grid (which Q-chunk has seen which K,V-chunk), only the diagonal is filled. The other 12 cells are missing. That is NOT attention.

💡 The Ring Attention trick

Don’t gather all K,V to every GPU (that would need full memory again). Instead: keep Q parked, and pass the K,V chunks around a ring, one hop at a time. Each GPU computes attention between its (fixed) Q and whatever K,V chunk is currently visiting, accumulates the partial result, then forwards the chunk to its neighbor. After N−1 = 3 rotations, every Q has met every K,V — the full 4×4 grid is covered — yet each GPU only ever held 2 chunks at a time (its own + one visitor). Watch it happen below.

Figure 5 — Without vs With Ring Attention: K,V Chunks Rotate, Coverage Grid Fills

1 / 5

Top: without sequence parallelism, one GPU must hold K,V for all 16 tokens — at 128k tokens this exceeds GPU capacity (red bar past the yellow capacity line). Bottom: with Ring Attention, each GPU keeps only its 4 Q tokens (parked, never move) plus ONE visiting K,V chunk. During rotation, slots show "…in transit…" and the colored chunks visibly travel to the next GPU (GPU 3's chunk wraps around the bottom path back to GPU 0). The coverage grid fills one diagonal per step — after 4 steps, every Q has seen every K,V: full attention at O(T/N) memory.

Megatron-LM Sequence Parallelism (SP)

A simpler form of SP used in Megatron. During the non-TP operations (LayerNorm, Dropout), the sequence is split using Reduce-Scatter and All-Gather to keep each GPU's memory proportional to s/N. This is combined with TP for a complete solution: no single GPU ever holds the full (batch × seq × hidden) activation tensor.

06

Expert Parallelism

Mixture-of-Experts (MoE) models replace one FFN per layer with many expert FFNs plus a router. Each token is processed by only 1–2 experts (sparse compute), but storing all experts on every GPU would multiply FFN memory by the expert count. Expert Parallelism (EP) solves this: place each expert on a different GPU, and physically ship tokens to wherever their assigned expert lives.

The router is a tiny learned linear layer: it scores each token against every expert and picks the top-k. Because tokens on GPU 0 may be routed to an expert on GPU 3, EP requires an All-to-All shuffle — every GPU simultaneously sends tokens to and receives tokens from every other GPU. This happens twice per MoE layer: once to dispatch tokens to experts, once to return results.

g = \text{softmax}(x \cdot W_r) \in \mathbb{R}^{E},\qquad \text{expert}(x) = \text{top-}k(g),\qquad y = \sum_{e \in \text{top-}k} g_e \cdot \text{FFN}_e(x)

The router: W_r is a small [d×E] matrix (E = number of experts). Each token x gets routed to its top-k experts (k=1 or 2) and the outputs are gate-weighted. Only the selected experts run — that’s the sparsity.

Figure 6.5 — Expert Parallelism, Step by Step: Experts Split, Tokens Travel

1 / 6

Six steps, one at a time: ① all 4 expert MLPs packed on one GPU — 4× FFN memory while each token only ever uses one. ② Each expert moves to its own GPU (whole experts — no matrix slicing). ③ The router assigns each token to one expert; the colored ring on each token shows its destination. ④ All-to-All #1: tokens travel to their expert’s GPU. ⑤ Each expert processes only its received tokens. ⑥ All-to-All #2: processed tokens return home. Cost: 2 All-to-Alls per MoE layer; gain: 1 expert of memory and 1 expert of compute per GPU.

⚠️ Load balancing — the catch

If the router sends most tokens to one popular expert, that GPU becomes a hotspot while others idle, and its receive buffer overflows (tokens beyond the capacity factor get dropped). Training adds an auxiliary load-balancing loss to keep expert assignment roughly uniform — the same problem and machinery covered in the MoE guide. DeepSeek-V3 instead uses auxiliary-loss-free balancing via per-expert bias terms.

📡 Communication cost

Each MoE layer moves every token’s hidden vector across GPUs twice: volume ≈ 2 × b × s × d × k bytes per layer. All-to-All latency is the typical bottleneck for cross-node EP, which is why production systems (DeepSeek-V3, Mixtral inference) keep EP groups within a node or overlap All-to-All with compute. EP composes with the other axes: DeepSeek-V3 trains with EP × PP × DP × ZeRO simultaneously.

07

Context Parallelism

Context Parallelism (CP) is a newer variant introduced by Megatron-LM for extreme long-context training (128k–1M tokens). Unlike Sequence Parallelism (which integrates with TP and handles the non-attention ops), CP focuses specifically on distributing the K/V cache across GPUs during attention computation.

Each GPU holds a T/N segment of the full sequence. For self-attention, Q is local, but K and V must span the full sequence. CP uses an All-Gather of K and V at the start of each attention layer, distributes the computation, then uses Reduce-Scatter to aggregate attention outputs. Or — more efficiently — uses the Ring Attention pattern to avoid the large All-Gather entirely.

Figure 6 — Context Parallelism, Step by Step: Shard the KV-Cache Along the Sequence

1 / 5

One cut at a time: the 320 GB KV cache (sequence × layers) doesn’t fit one GPU → slice it VERTICALLY along the sequence (contrast: pipeline parallelism slices horizontally by layers) → each GPU receives one token-range slice across all layers → during attention the slices exchange K,V via the Ring Attention pattern (Figure 5) → memory per GPU = 320/N, so longer context just means more GPUs.

💻 Memory scaling

Without CP: KV cache for 128k tokens at d=8192 = 128k × 2 × 8192 × 2 bytes ≈ 4 GB per layer. With 80 layers: 320 GB just for KV cache. With CP N=8: 40 GB total, 5 GB per GPU per layer. Makes long-context training feasible on standard hardware.

📡 CP vs SP

SP integrates with TP and handles everything (LayerNorm, Dropout, Attention) along the sequence dimension. CP is specifically for the KV attention bottleneck. They can be combined: use SP for non-attention layers and CP/Ring Attention for the attention layers themselves.

08

3D Parallelism — Combining Everything

Production LLM training (GPT-4, Llama-3 70B, DeepSeek-V3) combines Data, Tensor, and Pipeline parallelism simultaneously. This is called 3D Parallelism [Narayanan et al., 2021]. The N_total GPUs are organized into a 3D grid: N_dp × N_tp × N_pp = N_total.

First, Baby Steps — Watch One Model + One Dataset Get Carved Into a 2×2×2 Grid

1 / 5

The three cuts are applied ONE AT A TIME, in order. Cut 1 (Pipeline): the MLP is split by layers — front half / back half. Cut 2 (Tensor): each stage’s layer weights are split in half — top neurons / bottom neurons. Cut 3 (Data): the entire 4-GPU assembly is cloned, and the dataset is split so each replica trains on different samples. Result: 2 (DP) × 2 (TP) × 2 (PP) = 8 GPUs, each with a unique coordinate.

Figure 7 — 3D Parallelism: GPU Organization Grid (N_dp=2, N_tp=4, N_pp=2)

3D Parallelism organizes GPUs into a 3D grid. Within a node: Tensor Parallelism using NVLink (fast, latency-sensitive). Across nodes within a PP stage: Pipeline Parallelism sends activations between adjacent stages. Across all nodes with same layers: Data Parallelism all-reduces gradients at the end of backward. Each GPU knows exactly what it holds (which DP/TP/PP coordinates it occupies).

How to Choose N_dp, N_tp, N_pp

Factor	Prefer higher TP when...	Prefer higher PP when...	Prefer higher DP when...
Communication bandwidth	High NVLink bandwidth within node	Low cross-node bandwidth okay	Moderate InfiniBand acceptable
Model type	Large FFN hidden dim (d_ff >> d)	Many layers (deep model)	Good hardware efficiency achieved
Batch size	Any	Large batch (more micro-batches → less bubble)	Large global batch target
Memory pressure	Single layer too large for 1 GPU	Too many layers for 1 GPU	Layers fit, need throughput

09

Communication Primitives

Every parallelism strategy reduces to a small set of collective communication operations. Understanding them reveals what's happening at the hardware level and why certain parallelism strategies require high-bandwidth links.

Figure 8 — The Four Core Collective Operations

All four collective operations used in distributed training. All-Reduce = gradient sync in DDP. All-Gather = re-materialise sharded params (ZeRO-3). Reduce-Scatter = first half of ZeRO gradient sharding; All-Reduce = Reduce-Scatter then All-Gather fused. All-to-All = Expert Parallelism token dispatch — the only O(N²) collective and the reason EP is communication-heavy.

Communication Cost Analysis

\text{All-Reduce: } \frac{2(N-1)}{N}\cdot M \quad\text{All-Gather: } \frac{N-1}{N}\cdot M \quad\text{Reduce-Scatter: } \frac{N-1}{N}\cdot M \quad\text{All-to-All: } \frac{N-1}{N}\cdot M

Communication volumes (M = data size). All-Reduce = All-Gather + Reduce-Scatter ≈ 2×. All-to-All sends N different messages, one to each peer — latency scales worse with N.

10

Selection Guide

Given a model and a GPU cluster, how do you choose your parallelism strategy? This is an engineering optimization with many trade-offs.

Model Size	GPU Count	Recommended Strategy	Notes
1B–7B	1–8 (1 node)	ZeRO-2/3 or FSDP	Usually fits with ZeRO-2. No PP needed. TP if very memory tight.
7B–70B	8–64 (1–8 nodes)	ZeRO-3 + TP(2–4)	TP within nodes. ZeRO-3 across nodes. PP if cross-node bw limited.
70B–500B	64–512	TP(8) + PP(4–8) + DP	Megatron-style 3D parallelism. PP across nodes. TP within node.
500B+ (MoE)	512+	TP + PP + DP + EP	Add Expert Parallelism. 4D or 5D. DeepSeek-V3: TP16 + PP16 + DP4 + EP8.
Long context (128k+)	Any	Add SP or CP	Sequence/Context parallelism orthogonal to TP/PP/DP. Combine as needed.

💡 The key decision framework

Start with ZeRO-3/FSDP — it requires no architectural changes and handles most model sizes up to ~70B on 64 GPUs. Add TP when a single layer is too large for one GPU (before or after ZeRO). Add PP when TP is maxed out within a node and you need to span nodes. Add SP/CP when sequence length exceeds GPU memory. Add EP only for MoE architectures. The general principle: use the parallelism with the cheapest communication cost first, and layer in expensive communication only when necessary.

References

Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC 2020. arXiv:1910.02054
Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053
Huang, Y. et al. (2019). GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. NeurIPS 2019. arXiv:1811.06965
Narayanan, D. et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC 2021. arXiv:2104.04473
Narayanan, D. et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019. arXiv:1806.03377
Korthikanti, V. et al. (2022). Reducing Activation Recomputation in Large Transformer Models (Sequence Parallelism). MLSys 2023. arXiv:2205.05198
Liu, H. et al. (2023). Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889
Brandon, W. et al. (2023). Striped Attention: Faster Ring Attention for Causal Transformers. arXiv:2311.09431
Jacobs, S.A. et al. (2023). DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv:2309.14509
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437 (5D parallelism)