BN
Batch Normalization
CNNs · fully-connected layers
1/3
\hat{x}_{s,d} = \frac{x_{s,d} - \mu_d}{\sigma_d}, \quad \mu_d = \frac{1}{B}\sum_b x_{b,s,d}
normaliseover the batch dimension B — one μ,σ per (S,D) position
squash
B ← averaged
S kept
D kept
problembatch stats unstable at small B; breaks with B=1 (inference). Needs running mean/var for eval.
learnableγ, β per (S,D) position — scale & shift after normalisation
LN
Layer Normalization
Transformers · BERT · GPT
1/3
\hat{x}_{b,s} = \frac{x_{b,s} - \mu_{b,s}}{\sigma_{b,s}}, \quad \mu_{b,s} = \frac{1}{D}\sum_d x_{b,s,d}
normaliseover the embedding dimension D — one μ,σ per token (b,s)
squash
B kept
S kept
D ← averaged
why LLMsindependent of B — works at B=1, no running stats needed, same at train/eval
learnableγ, β of size D — one scale & shift per embedding dimension
RMS
RMSNorm
LLaMA · Mistral · Gemma · Qwen
1/3
\hat{x}_{b,s} = \frac{x_{b,s}}{\text{RMS}(x_{b,s})}, \quad \text{RMS} = \sqrt{\frac{1}{D}\sum_d x_{b,s,d}^2}
vs LayerNormno mean subtraction — only RMS scaling. Assumes zero mean, which holds in practice.
squash
B kept
S kept
D → RMS only
why fastersaves one mean computation per token — negligible in FLOPs but reduces numerical operations in the critical path
learnableγ of size D only — no β (no re-centering)
GN
Group Normalization
Vision models · small batch size
1/3
\hat{x}_{b,s,d} = \frac{x - \mu_g}{\sigma_g}, \quad g = \lfloor d \cdot G / D \rfloor
normaliseD split into G groups — normalise within each group, per token
squash
B kept
S kept
D/G ← averaged
B-freelike LayerNorm, stats computed within a single sample. G=1 → LayerNorm; G=D → InstanceNorm.
learnableγ, β of size D
IN
Instance Normalization
Style transfer · generative models
1/3
\hat{x}_{b,d} = \frac{x_{b,d} - \mu_{b,d}}{\sigma_{b,d}}, \quad \mu_{b,d} = \frac{1}{S}\sum_s x_{b,s,d}
normaliseover the sequence dimension S — one μ,σ per (batch, feature)
squash
B kept
S ← averaged
D kept
effectremoves per-instance style information — useful for style transfer where you want to inject a new style
in NLPrarely used — averaging over sequence positions mixes positional meaning
CMP
Side-by-Side Comparison
| Method | Avg over | Stats per | B=1 safe? | Train=Eval? | Use in LLMs? |
|---|---|---|---|---|---|
| BatchNorm | B | (S,D) position | ✗ No | ✗ running stats | Rarely |
| LayerNorm | D | token (b,s) | ✓ Yes | ✓ same | Standard (GPT, BERT) |
| RMSNorm | D (no mean) | token (b,s) | ✓ Yes | ✓ same | ⭐ LLaMA, Mistral… |
| GroupNorm | D/G group | token × group | ✓ Yes | ✓ same | Vision / LoRA |
| InstanceNorm | S | (b,d) pair | ✓ Yes | ✓ same | Style transfer only |