LLM Sampling: Every Decoding Strategy, With Numbers

The Loop, and Our Running Example

Generation is a loop: feed the context in, get one vector of logits out (one raw score per vocabulary token), turn it into a distribution with softmax, choose one token by some rule, append it, repeat. Every method in this blog is a different "choose one token by some rule".

p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\qquad\quad x_{t+1} \sim \text{rule}\big(p\big),\quad \text{append, repeat}

z = logits, p = the next token distribution. The model's job ends at p. The rule is ours.

Our fixed worked example for the whole blog. The context is The cat sat on the ___ and we pretend the vocabulary has six tokens. The model returns these logits, and softmax turns them into these probabilities:

One Forward Pass: Logits In, a Distribution Out

Logits are arbitrary real scores; softmax exponentiates and normalizes. mat dominates but does not monopolize: 59 percent. sofa and floor are sensible. moon is unlikely and banana is noise, but neither is exactly zero, and that tail is where degenerate outputs come from. Every method below is a different way of handling this one distribution.

Memorize the row, it returns in every section: mat .590, sofa .217, floor .132, bed .048, moon .011, banana .002 from logits 6.0, 5.0, 4.5, 3.5, 2.0, 0.0.

Greedy Decoding, and Why the Obvious Thing Fails

The obvious rule: take the argmax. Here that means mat, every single time, deterministically. Greedy is fast, reproducible, and for short factual answers often fine. For open ended text it has a famous disease: repetition loops. Once a phrase appears, its own presence in the context raises the probability of appearing again, argmax takes it again, which raises it further, and the model spirals into "I'm sorry. I'm sorry. I'm sorry." This is a positive feedback loop between the context and a deterministic rule, and the model itself assigns the looped text high probability while any human rates it garbage.

The Greedy Spiral

Each repetition of the phrase raises its own probability in the next step, and argmax has no mechanism to ever step off the track. The distribution under the curve is healthy; the deterministic rule is what locks the trajectory. A single sampled token would have broken the loop.

BUT WAIT Isn't the most likely text by definition the best text? Why would we ever want anything except the highest probability sequence? ▶

This is the deepest question in decoding, and the answer is no, for two separate reasons.

Likelihood and quality are different objectives. The model was trained to predict human text, and human text is not maximally predictable. Natural language carries a steady rate of surprise: specific word choices, new information, turns of phrase. If you select for minimum surprise at every step, you get text that is more predictable than any text a human would write: safe, repetitive, hollow. The nucleus sampling paper measured exactly this: human written continuations live in a band of moderate per token probability, while maximization style decoding lives far above that band, in a probability zone humans simply do not occupy.

Greedy does not even find the most likely sequence. Argmax is locally optimal per step, but sequence probability is a product over steps, and a slightly worse token now can unlock a much better continuation later. Finding the true argmax sequence is intractable (the search tree is vocabulary to the power length), and beam search, the practical approximation in §5, shows that even when you search harder for high likelihood, open ended quality gets worse, not better. The objective itself is wrong for creativity, not just the optimizer.

The honest synthesis: maximization is right when there is one correct answer whose form matters (translation, transcription, structured extraction). Sampling is right when the task has many valid continuations and blandness is a failure mode. Most of this blog is about controlling sampling's randomness rather than eliminating it.

Temperature: One Knob for the Shape of Randomness

The gentlest intervention: divide the logits by T before softmax.

p_i(T) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

T below 1 sharpens (amplifies logit gaps), T above 1 flattens (shrinks them). T to 0 becomes argmax, T to infinity becomes uniform.

Run the worked example through three temperatures, exact numbers:

T = 0.5: logits double to 12, 10, 9, 7, 4, 0. Result: mat .839, sofa .114, floor .042, bed .006. The distribution sharpens hard, the tail dies.
T = 1: the original row: mat .590, sofa .217, floor .132.
T = 2: logits halve to 3, 2.5, 2.25, 1.75, 1, 0. Result: mat .392, sofa .238, floor .185, bed .112, moon .053, banana .020. The tail wakes up: banana is now a 1 in 50 event per step, and a 200 token generation will roll those dice 200 times.

The Same Logits at Three Temperatures

Low temperature concentrates mass on the favorite, high temperature redistributes it toward the tail. The danger of high T is not the visible tokens, it is the long tail of thousands of banana grade tokens in a real vocabulary, each individually negligible, collectively a per step risk of derailment. This is exactly why temperature is almost always paired with a cutoff from the next section.

The Cutoff Family: Top-k, Top-p, Min-p

Temperature reshapes the distribution but never removes the tail. The cutoff family deletes it: zero out the junk, renormalize what survives, sample from that. Three rules for where to draw the line.

Top-k: keep the k best

Sort, keep the k highest, renormalize. With k = 3 on the worked example, the survivors are mat, sofa, floor with mass .939, and renormalizing gives mat .628, sofa .231, floor .141. bed, moon, banana are now impossible, at any temperature.

Top-p (nucleus): keep the smallest set covering probability p

Sort, walk down accumulating probability, stop once the running sum reaches p, renormalize the kept set. With p = 0.9: cumulative .590, .807, .939, stop. Same three survivors here. The difference from top-k is invisible on one example and decisive across contexts, because the kept set size adapts to the model's confidence:

\text{keep the smallest } S:\ \sum_{i \in S} p_i \geq p \qquad\text{(tokens sorted descending)}

A confident step keeps 1 token, an open step keeps 50. k cannot do both with one value.

Min-p: keep everything within a fraction of the best

The newest member: keep token i if p_i ≥ p_min × p_max. With p_min = 0.1: the bar is 0.1 × .590 = .059. mat, sofa, floor pass; bed at .048 just misses. The threshold scales with confidence automatically, like top-p, but it is anchored to the peak rather than the cumulative mass, which behaves better at high temperatures: heating flattens the peak, the bar drops with it, and diversity rises gracefully instead of suddenly admitting the deep tail.

Three Rulers on the Same Bars

The worked distribution, sorted, with each rule's cut drawn. Top-k counts bars, top-p accumulates area, min-p sets a height bar relative to the tallest. On this distribution all three happen to keep the same three tokens. The next figure shows why they part ways.

Why Adaptive Beats Fixed: Two Extreme Contexts

Left, a peaked context: Paris is the capital of ___. The model puts 97 percent on one token. Top-p keeps exactly that one; a fixed k = 3 forcibly admits two junk tokens that should be impossible. Right, a flat creative context: She opened the door and saw ___, dozens of equally plausible continuations. Top-p keeps around 40 of them; k = 3 amputates the diversity that was the entire point. One fixed number cannot serve both shapes, an adaptive rule serves both automatically.

Production default you will see everywhere: temperature 0.7 to 1.0 combined with top-p 0.9 to 0.95, or increasingly min-p 0.05 to 0.1 for high temperature creative work. Order of operations matters and the convention is: temperature first, then the cutoff, then renormalize, then sample.

Beam Search: When You Actually Want the Argmax Sequence

Greedy is locally optimal and globally wrong: sequence probability is a product, and a slightly worse word now can buy a much better continuation. Beam search keeps the B best partial sequences alive at every step, extends them all, and keeps the best B of the extensions.

A two step worked example. Step one offers A .40 and B .35. Greedy grabs A. But A's best continuation is worth .50 while B's is worth .90:

P(A,\,\text{best}) = .40 \times .50 = .20 \qquad\quad P(B,\,\text{best}) = .35 \times .90 = .315

The greedy path loses to a path it never explored. Beam width 2 keeps both alive at step one and finds the .315 sequence.

The Tree Greedy Never Sees

Greedy commits to the locally best branch and inherits its mediocre subtree. Beam search carries two candidates one step further and lets the products decide. In practice scores are summed log probabilities, and a length penalty divides by length to the power α, because raw products systematically favor short sequences.

The interview nuance: beam search is the right tool exactly where the §2 collapsible said maximization is right, closed tasks with a correct answer: translation, speech recognition, summarization with tight fidelity needs. On open ended generation, larger beams make output worse: blander, more repetitive, the high probability boredom zone, searched more thoroughly. Knowing that beam quality degrades with beam size on creative tasks is a classic interview checkpoint.

Repetition Penalties: Editing the Logits by History

The cutoff family looks only at the current distribution. Penalties look at the history and push down tokens that already appeared. Two formulations dominate, and interviews love asking the difference.

Multiplicative repetition penalty (Hugging Face, CTRL): for every token already in the context, divide its logit by θ if positive, multiply if negative, with θ around 1.1 to 1.3:

z_i' = \begin{cases} z_i / \theta & z_i > 0\\ z_i \cdot \theta & z_i \le 0\end{cases}\qquad\text{for every } i \text{ already generated}

Asymmetric on purpose: it pushes logits toward and past zero from both sides.

Additive frequency and presence penalties (OpenAI API): subtract from the logit a term per occurrence count, plus a flat term for having occurred at all:

z_i' = z_i \;-\; c_i \cdot \lambda_{freq} \;-\; \mathbf{1}[c_i > 0] \cdot \lambda_{pres}

c_i = how many times token i has been generated. Frequency scales with repetition, presence is one flat tax for ever appearing: the first nudges away from overuse, the second nudges toward new vocabulary.

Worked example: suppose mat was already generated once and θ = 1.25. Its logit drops from 6.0 to 4.8, and recomputing softmax over the row gives a new leader:

One Penalty Application, Recomputed Exactly

Before: mat .590 dominates. After dividing its logit by 1.25: mat falls to .303 and sofa, untouched at .370, takes the lead. A single multiplicative penalty flipped the argmax and broke the would be loop of §2. The risk of overdoing it: penalize too hard and the model starts avoiding necessary words, pronouns, code keywords, the character's own name.

Speculative Decoding: Same Output, Several Times Faster

Everything so far changed what token gets picked. Speculative decoding changes how fast tokens arrive while provably leaving the distribution untouched. The pain it solves: a big model generates one token per forward pass, and each pass is slow and memory bound. The trick: let a tiny fast draft model guess several tokens ahead, then have the big target model verify all of them in a single parallel pass.

Why verification is one pass and not many: a transmitted prefix can be scored for every position at once, the same property that makes training parallel. So the big model checks a 5 token guess as cheaply as it would have produced 1.

The acceptance rule is the elegant part, and it is exact. For each drafted token, compare the target probability q to the draft probability p:

\text{accept with probability } \min\!\Big(1,\ \frac{q(x)}{p(x)}\Big);\quad\text{on reject, resample from } \max(0,\ q - p)\text{ normalized}

This is rejection sampling tuned so the accepted stream is distributed exactly as the target model alone. Speed with zero quality cost, mathematically guaranteed.

Worked example. The draft proposes the cat sat on the mat, 5 tokens, in 5 cheap passes. The target verifies in 1 pass and agrees with the first 4; on the 5th it wanted sofa, so it rejects there and substitutes. Net: 5 tokens produced in the time of roughly 1 big pass plus the cheap draft, instead of 5 big passes. Typical real speedups are 2 to 3 times, larger when the draft is accurate.

Draft Many Cheaply, Verify All at Once

Top lane: the small draft model races ahead proposing five tokens. Middle: the big target model scores all five positions in a single parallel pass. Bottom: the accept and reject ledger, four greens accepted, the fifth rejected and corrected by a draw from the target's residual. The output is bit for bit a sample from the target model, only produced in far fewer big passes. Variants: Medusa adds extra prediction heads instead of a separate draft model; EAGLE drafts in feature space; lookahead decoding drops the draft model entirely.

Where Sampling Meets Everything Else You Know

Decoding is not a sealed box at the end. It touches alignment, structured output, and RL directly, and interviewers probe these seams.

Sampling is the data source for RLHF and GRPO. When GRPO samples G responses per prompt, it samples them with temperature and top-p. Those knobs set the diversity of the group, which sets the spread of rewards, which sets the variance of the advantage estimate. Temperature too low and all G responses are near identical, the group has nothing to learn from; too high and they are noise. The decoding settings are a hyperparameter of training, not just inference.

Constrained and structured decoding. To force valid JSON or a regex or a grammar, mask the logits: at each step set the logits of every token that would violate the grammar to negative infinity, then sample normally from what remains. The model proposes, the grammar disposes, and the output is guaranteed parseable. This is how function calling and tool schemas are enforced.

Guidance steers the distribution before you sample. Classifier free guidance, the same idea from the diffusion and flow matching blogs, has a text analog: combine a conditional and unconditional logit vector to sharpen toward the prompt. Contrastive decoding subtracts a weak model's logits from a strong model's to amplify exactly what expertise adds. Both edit p before the rule of §1 ever runs.

One Distribution, Many Hands Reaching for It

The next token distribution sits in the center. Around it, every actor that touches it: temperature reshapes, cutoffs prune, penalties edit by history, grammar masks for validity, guidance sharpens toward intent, and RL training consumes whole sampled sequences as its data. Decoding is the junction where the model, the user's intent, and the training loop all meet on the same vector of numbers.

The Interview Cheat Sheet

Method	What it does	One line intuition	Use when
Greedy	argmax every step	most likely next token, deterministic	one right answer, short, reproducible
Beam search	keep B best partial sequences	search for the likely sequence, not token	translation, ASR, closed tasks
Temperature	scale logits by 1/T	one knob for sharpness of randomness	always, as the base randomness control
Top-k	keep k highest, renormalize	fixed size shortlist	simple, predictable cap
Top-p (nucleus)	keep smallest set covering p	shortlist that adapts to confidence	the modern default for open text
Min-p	keep tokens above p_min × peak	height bar relative to the favorite	high temperature creative work
Repetition / freq / presence	push down seen tokens	edit logits by history	kill loops, broaden vocabulary
Speculative	draft then parallel verify	same distribution, fewer big passes	latency, always, free quality wise
Constrained	mask invalid logits	grammar disposes, model proposes	JSON, tool calls, regex output

The four questions an interviewer is really testing

Do you know likelihood is not quality? The §2 collapsible: human text lives in a moderate probability band, maximization overshoots it into bland repetition, which is why we sample at all.

Do you understand adaptivity? Top-p and min-p adjust their kept set to the model's per step confidence; top-k cannot. The peaked capital of France versus the flat open door contrast is the canonical proof.

Do you know the order of operations? Temperature, then cutoff, then renormalize, then sample, with penalties applied to logits before softmax. Getting this wrong silently changes behavior.

Do you see the seams? Decoding is the data generator for RL, the enforcement point for structured output, and the target of guidance. The settings you pick at inference are also training hyperparameters.

The one paragraph summary

The model outputs a distribution; decoding is the rule that turns it into a token. Greedy and beam chase likelihood and suit closed tasks but go bland on open ones, because likelihood is not quality. Temperature sets the randomness, and a cutoff, top-k fixed, top-p and min-p adaptive, removes the dangerous tail so the randomness stays sane. Penalties edit logits by history to break loops. Speculative decoding makes it faster with zero quality cost by drafting and verifying. And the whole apparatus is the seam where the model, the user's intent through guidance and grammars, and the RL training loop all meet on one vector of numbers.

LLM Sampling: the Last Inch of Generation