Transformer internals and the ideas that make large language models work. Attention, positional encoding, normalization, distributed training, and mixture of experts — each post picks one mechanism and goes all the way down.
Nine sections covering every major decoding strategy: why greedy decoding fails on longer sequences, what temperature actually does to the …
Nine sections covering the full arc of positional encoding. Starts with why attention is a set operation and order simply doesn't exist …
Five normalization methods. One tensor shape. The difference is just which dimensions you average over and that choice has large …
Eleven sections from self-attention through FlashAttention-3. The first half covers the mechanism: QKV projections, cross-attention, …
A walk through data parallel, tensor parallel, pipeline parallel, ZeRO, FSDP, and expert parallel, and how those axes compose into the 3D …
Most modern frontier models route each token through a small subset of experts instead of one dense MLP. This piece is about why that scales …
End of series.
Back to AI Notebook