Normalization in Deep Learning

Every method normalizes the same tensor. The only question is which dimensions you average over — and what that means for your batch, tokens, and features.

tensor shape: [B, S, D] = [batch, sequence, embedding_dim]
Normalization Methods overview
BN Batch Normalization CNNs · fully-connected layers
1/3
\hat{x}_{s,d} = \frac{x_{s,d} - \mu_d}{\sigma_d}, \quad \mu_d = \frac{1}{B}\sum_b x_{b,s,d}
normaliseover the batch dimension B — one μ,σ per (S,D) position
squash
B ← averaged S kept D kept
problembatch stats unstable at small B; breaks with B=1 (inference). Needs running mean/var for eval.
learnableγ, β per (S,D) position — scale & shift after normalisation
LN Layer Normalization Transformers · BERT · GPT
1/3
\hat{x}_{b,s} = \frac{x_{b,s} - \mu_{b,s}}{\sigma_{b,s}}, \quad \mu_{b,s} = \frac{1}{D}\sum_d x_{b,s,d}
normaliseover the embedding dimension D — one μ,σ per token (b,s)
squash
B kept S kept D ← averaged
why LLMsindependent of B — works at B=1, no running stats needed, same at train/eval
learnableγ, β of size D — one scale & shift per embedding dimension
RMS RMSNorm LLaMA · Mistral · Gemma · Qwen
1/3
\hat{x}_{b,s} = \frac{x_{b,s}}{\text{RMS}(x_{b,s})}, \quad \text{RMS} = \sqrt{\frac{1}{D}\sum_d x_{b,s,d}^2}
vs LayerNormno mean subtraction — only RMS scaling. Assumes zero mean, which holds in practice.
squash
B kept S kept D → RMS only
why fastersaves one mean computation per token — negligible in FLOPs but reduces numerical operations in the critical path
learnableγ of size D only — no β (no re-centering)
GN Group Normalization Vision models · small batch size
1/3
\hat{x}_{b,s,d} = \frac{x - \mu_g}{\sigma_g}, \quad g = \lfloor d \cdot G / D \rfloor
normaliseD split into G groups — normalise within each group, per token
squash
B kept S kept D/G ← averaged
B-freelike LayerNorm, stats computed within a single sample. G=1 → LayerNorm; G=D → InstanceNorm.
learnableγ, β of size D
IN Instance Normalization Style transfer · generative models
1/3
\hat{x}_{b,d} = \frac{x_{b,d} - \mu_{b,d}}{\sigma_{b,d}}, \quad \mu_{b,d} = \frac{1}{S}\sum_s x_{b,s,d}
normaliseover the sequence dimension S — one μ,σ per (batch, feature)
squash
B kept S ← averaged D kept
effectremoves per-instance style information — useful for style transfer where you want to inject a new style
in NLPrarely used — averaging over sequence positions mixes positional meaning
CMP Side-by-Side Comparison
MethodAvg overStats perB=1 safe?Train=Eval?Use in LLMs?
BatchNormB(S,D) position✗ No✗ running statsRarely
LayerNormDtoken (b,s)✓ Yes✓ sameStandard (GPT, BERT)
RMSNormD (no mean)token (b,s)✓ Yes✓ same⭐ LLaMA, Mistral…
GroupNormD/G grouptoken × group✓ Yes✓ sameVision / LoRA
InstanceNormS(b,d) pair✓ Yes✓ sameStyle transfer only