The Loop, and Our Running Example
Generation is a loop: feed the context in, get one vector of logits out (one raw score per vocabulary token), turn it into a distribution with softmax, choose one token by some rule, append it, repeat. Every method in this blog is a different "choose one token by some rule".
Our fixed worked example for the whole blog. The context is The cat sat on the ___ and we pretend the vocabulary has six tokens. The model returns these logits, and softmax turns them into these probabilities:
Memorize the row, it returns in every section: mat .590, sofa .217, floor .132, bed .048, moon .011, banana .002 from logits 6.0, 5.0, 4.5, 3.5, 2.0, 0.0.
Greedy Decoding, and Why the Obvious Thing Fails
The obvious rule: take the argmax. Here that means mat, every single time, deterministically. Greedy is fast, reproducible, and for short factual answers often fine. For open ended text it has a famous disease: repetition loops. Once a phrase appears, its own presence in the context raises the probability of appearing again, argmax takes it again, which raises it further, and the model spirals into "I'm sorry. I'm sorry. I'm sorry." This is a positive feedback loop between the context and a deterministic rule, and the model itself assigns the looped text high probability while any human rates it garbage.
BUT WAIT Isn't the most likely text by definition the best text? Why would we ever want anything except the highest probability sequence? ▶
This is the deepest question in decoding, and the answer is no, for two separate reasons.
Likelihood and quality are different objectives. The model was trained to predict human text, and human text is not maximally predictable. Natural language carries a steady rate of surprise: specific word choices, new information, turns of phrase. If you select for minimum surprise at every step, you get text that is more predictable than any text a human would write: safe, repetitive, hollow. The nucleus sampling paper measured exactly this: human written continuations live in a band of moderate per token probability, while maximization style decoding lives far above that band, in a probability zone humans simply do not occupy.
Greedy does not even find the most likely sequence. Argmax is locally optimal per step, but sequence probability is a product over steps, and a slightly worse token now can unlock a much better continuation later. Finding the true argmax sequence is intractable (the search tree is vocabulary to the power length), and beam search, the practical approximation in §5, shows that even when you search harder for high likelihood, open ended quality gets worse, not better. The objective itself is wrong for creativity, not just the optimizer.
The honest synthesis: maximization is right when there is one correct answer whose form matters (translation, transcription, structured extraction). Sampling is right when the task has many valid continuations and blandness is a failure mode. Most of this blog is about controlling sampling's randomness rather than eliminating it.
Temperature: One Knob for the Shape of Randomness
The gentlest intervention: divide the logits by T before softmax.
Run the worked example through three temperatures, exact numbers:
T = 0.5: logits double to 12, 10, 9, 7, 4, 0. Result: mat .839, sofa .114, floor .042, bed .006. The distribution sharpens hard, the tail dies.
T = 1: the original row: mat .590, sofa .217, floor .132.
T = 2: logits halve to 3, 2.5, 2.25, 1.75, 1, 0. Result: mat .392, sofa .238, floor .185, bed .112, moon .053, banana .020. The tail wakes up: banana is now a 1 in 50 event per step, and a 200 token generation will roll those dice 200 times.
The Cutoff Family: Top-k, Top-p, Min-p
Temperature reshapes the distribution but never removes the tail. The cutoff family deletes it: zero out the junk, renormalize what survives, sample from that. Three rules for where to draw the line.
Top-k: keep the k best
Sort, keep the k highest, renormalize. With k = 3 on the worked example, the survivors are mat, sofa, floor with mass .939, and renormalizing gives mat .628, sofa .231, floor .141. bed, moon, banana are now impossible, at any temperature.
Top-p (nucleus): keep the smallest set covering probability p
Sort, walk down accumulating probability, stop once the running sum reaches p, renormalize the kept set. With p = 0.9: cumulative .590, .807, .939, stop. Same three survivors here. The difference from top-k is invisible on one example and decisive across contexts, because the kept set size adapts to the model's confidence:
Min-p: keep everything within a fraction of the best
The newest member: keep token i if p_i ≥ p_min × p_max. With p_min = 0.1: the bar is 0.1 × .590 = .059. mat, sofa, floor pass; bed at .048 just misses. The threshold scales with confidence automatically, like top-p, but it is anchored to the peak rather than the cumulative mass, which behaves better at high temperatures: heating flattens the peak, the bar drops with it, and diversity rises gracefully instead of suddenly admitting the deep tail.
Production default you will see everywhere: temperature 0.7 to 1.0 combined with top-p 0.9 to 0.95, or increasingly min-p 0.05 to 0.1 for high temperature creative work. Order of operations matters and the convention is: temperature first, then the cutoff, then renormalize, then sample.
Beam Search: When You Actually Want the Argmax Sequence
Greedy is locally optimal and globally wrong: sequence probability is a product, and a slightly worse word now can buy a much better continuation. Beam search keeps the B best partial sequences alive at every step, extends them all, and keeps the best B of the extensions.
A two step worked example. Step one offers A .40 and B .35. Greedy grabs A. But A's best continuation is worth .50 while B's is worth .90:
The interview nuance: beam search is the right tool exactly where the §2 collapsible said maximization is right, closed tasks with a correct answer: translation, speech recognition, summarization with tight fidelity needs. On open ended generation, larger beams make output worse: blander, more repetitive, the high probability boredom zone, searched more thoroughly. Knowing that beam quality degrades with beam size on creative tasks is a classic interview checkpoint.
Repetition Penalties: Editing the Logits by History
The cutoff family looks only at the current distribution. Penalties look at the history and push down tokens that already appeared. Two formulations dominate, and interviews love asking the difference.
Multiplicative repetition penalty (Hugging Face, CTRL): for every token already in the context, divide its logit by θ if positive, multiply if negative, with θ around 1.1 to 1.3:
Additive frequency and presence penalties (OpenAI API): subtract from the logit a term per occurrence count, plus a flat term for having occurred at all:
Worked example: suppose mat was already generated once and θ = 1.25. Its logit drops from 6.0 to 4.8, and recomputing softmax over the row gives a new leader:
Speculative Decoding: Same Output, Several Times Faster
Everything so far changed what token gets picked. Speculative decoding changes how fast tokens arrive while provably leaving the distribution untouched. The pain it solves: a big model generates one token per forward pass, and each pass is slow and memory bound. The trick: let a tiny fast draft model guess several tokens ahead, then have the big target model verify all of them in a single parallel pass.
Why verification is one pass and not many: a transmitted prefix can be scored for every position at once, the same property that makes training parallel. So the big model checks a 5 token guess as cheaply as it would have produced 1.
The acceptance rule is the elegant part, and it is exact. For each drafted token, compare the target probability q to the draft probability p:
Worked example. The draft proposes the cat sat on the mat, 5 tokens, in 5 cheap passes. The target verifies in 1 pass and agrees with the first 4; on the 5th it wanted sofa, so it rejects there and substitutes. Net: 5 tokens produced in the time of roughly 1 big pass plus the cheap draft, instead of 5 big passes. Typical real speedups are 2 to 3 times, larger when the draft is accurate.
Where Sampling Meets Everything Else You Know
Decoding is not a sealed box at the end. It touches alignment, structured output, and RL directly, and interviewers probe these seams.
Sampling is the data source for RLHF and GRPO. When GRPO samples G responses per prompt, it samples them with temperature and top-p. Those knobs set the diversity of the group, which sets the spread of rewards, which sets the variance of the advantage estimate. Temperature too low and all G responses are near identical, the group has nothing to learn from; too high and they are noise. The decoding settings are a hyperparameter of training, not just inference.
Constrained and structured decoding. To force valid JSON or a regex or a grammar, mask the logits: at each step set the logits of every token that would violate the grammar to negative infinity, then sample normally from what remains. The model proposes, the grammar disposes, and the output is guaranteed parseable. This is how function calling and tool schemas are enforced.
Guidance steers the distribution before you sample. Classifier free guidance, the same idea from the diffusion and flow matching blogs, has a text analog: combine a conditional and unconditional logit vector to sharpen toward the prompt. Contrastive decoding subtracts a weak model's logits from a strong model's to amplify exactly what expertise adds. Both edit p before the rule of §1 ever runs.
The Interview Cheat Sheet
| Method | What it does | One line intuition | Use when |
|---|---|---|---|
| Greedy | argmax every step | most likely next token, deterministic | one right answer, short, reproducible |
| Beam search | keep B best partial sequences | search for the likely sequence, not token | translation, ASR, closed tasks |
| Temperature | scale logits by 1/T | one knob for sharpness of randomness | always, as the base randomness control |
| Top-k | keep k highest, renormalize | fixed size shortlist | simple, predictable cap |
| Top-p (nucleus) | keep smallest set covering p | shortlist that adapts to confidence | the modern default for open text |
| Min-p | keep tokens above p_min × peak | height bar relative to the favorite | high temperature creative work |
| Repetition / freq / presence | push down seen tokens | edit logits by history | kill loops, broaden vocabulary |
| Speculative | draft then parallel verify | same distribution, fewer big passes | latency, always, free quality wise |
| Constrained | mask invalid logits | grammar disposes, model proposes | JSON, tool calls, regex output |
The four questions an interviewer is really testing
Do you know likelihood is not quality? The §2 collapsible: human text lives in a moderate probability band, maximization overshoots it into bland repetition, which is why we sample at all.
Do you understand adaptivity? Top-p and min-p adjust their kept set to the model's per step confidence; top-k cannot. The peaked capital of France versus the flat open door contrast is the canonical proof.
Do you know the order of operations? Temperature, then cutoff, then renormalize, then sample, with penalties applied to logits before softmax. Getting this wrong silently changes behavior.
Do you see the seams? Decoding is the data generator for RL, the enforcement point for structured output, and the target of guidance. The settings you pick at inference are also training hyperparameters.
The model outputs a distribution; decoding is the rule that turns it into a token. Greedy and beam chase likelihood and suit closed tasks but go bland on open ones, because likelihood is not quality. Temperature sets the randomness, and a cutoff, top-k fixed, top-p and min-p adaptive, removes the dangerous tail so the randomness stays sane. Penalties edit logits by history to break loops. Speculative decoding makes it faster with zero quality cost by drafting and verifying. And the whole apparatus is the seam where the model, the user's intent through guidance and grammars, and the RL training loop all meet on one vector of numbers.