Normalization Methods — Visual Guide

BN Batch Normalization CNNs · fully-connected layers

1/3

\hat{x}_{s,d} = \frac{x_{s,d} - \mu_d}{\sigma_d}, \quad \mu_d = \frac{1}{B}\sum_b x_{b,s,d}

normaliseover the batch dimension B — one μ,σ per (S,D) position

squash

B ← averaged S kept D kept

problembatch stats unstable at small B; breaks with B=1 (inference). Needs running mean/var for eval.

learnableγ, β per (S,D) position — scale & shift after normalisation

LN Layer Normalization Transformers · BERT · GPT

1/3

\hat{x}_{b,s} = \frac{x_{b,s} - \mu_{b,s}}{\sigma_{b,s}}, \quad \mu_{b,s} = \frac{1}{D}\sum_d x_{b,s,d}

normaliseover the embedding dimension D — one μ,σ per token (b,s)

squash

B kept S kept D ← averaged

why LLMsindependent of B — works at B=1, no running stats needed, same at train/eval

learnableγ, β of size D — one scale & shift per embedding dimension

RMS RMSNorm LLaMA · Mistral · Gemma · Qwen

1/3

\hat{x}_{b,s} = \frac{x_{b,s}}{\text{RMS}(x_{b,s})}, \quad \text{RMS} = \sqrt{\frac{1}{D}\sum_d x_{b,s,d}^2}

vs LayerNormno mean subtraction — only RMS scaling. Assumes zero mean, which holds in practice.

squash

B kept S kept D → RMS only

why fastersaves one mean computation per token — negligible in FLOPs but reduces numerical operations in the critical path

learnableγ of size D only — no β (no re-centering)

GN Group Normalization Vision models · small batch size

1/3

\hat{x}_{b,s,d} = \frac{x - \mu_g}{\sigma_g}, \quad g = \lfloor d \cdot G / D \rfloor

normaliseD split into G groups — normalise within each group, per token

squash

B kept S kept D/G ← averaged

B-freelike LayerNorm, stats computed within a single sample. G=1 → LayerNorm; G=D → InstanceNorm.

learnableγ, β of size D

IN Instance Normalization Style transfer · generative models

1/3

\hat{x}_{b,d} = \frac{x_{b,d} - \mu_{b,d}}{\sigma_{b,d}}, \quad \mu_{b,d} = \frac{1}{S}\sum_s x_{b,s,d}

normaliseover the sequence dimension S — one μ,σ per (batch, feature)

squash

B kept S ← averaged D kept

effectremoves per-instance style information — useful for style transfer where you want to inject a new style

in NLPrarely used — averaging over sequence positions mixes positional meaning

CMP Side-by-Side Comparison

Method	Avg over	Stats per	B=1 safe?	Train=Eval?	Use in LLMs?
BatchNorm	B	(S,D) position	✗ No	✗ running stats	Rarely
LayerNorm	D	token (b,s)	✓ Yes	✓ same	Standard (GPT, BERT)
RMSNorm	D (no mean)	token (b,s)	✓ Yes	✓ same	⭐ LLaMA, Mistral…
GroupNorm	D/G group	token × group	✓ Yes	✓ same	Vision / LoRA
InstanceNorm	S	(b,d) pair	✓ Yes	✓ same	Style transfer only

Normalization in Deep Learning