A Series · The Notebook

LLM

Transformer internals and the ideas that make large language models work. Attention, positional encoding, normalization, distributed training, and mixture of experts — each post picks one mechanism and goes all the way down.

6 Stories

~116m Total Read

2026 Last Updated

Nothing matched. Try a shorter query.

Sort

Jun 17, 2026 18 min read

LLM Sampling

Nine sections covering every major decoding strategy: why greedy decoding fails on longer sequences, what temperature actually does to the …

Jun 11, 2026 25 min read

Positional Encoding

Nine sections covering the full arc of positional encoding. Starts with why attention is a set operation and order simply doesn't exist …

Jun 10, 2026 12 min read

Normalization Methods

Five normalization methods. One tensor shape. The difference is just which dimensions you average over and that choice has large …

Jun 6, 2026 28 min read

Attention & FlashAttention

Eleven sections from self-attention through FlashAttention-3. The first half covers the mechanism: QKV projections, cross-attention, …

May 1, 2026 15 min read

LLM Parallelism

A walk through data parallel, tensor parallel, pipeline parallel, ZeRO, FSDP, and expert parallel, and how those axes compose into the 3D …

Apr 24, 2026 18 min read

Mixture of Experts

Most modern frontier models route each token through a small subset of experts instead of one dense MLP. This piece is about why that scales …

End of series.

Back to AI Notebook