The Core Distinction: Outcome vs Process
A reward model scores model outputs. There are two places to put the score. An Outcome Reward Model (ORM) reads the whole solution and emits one number: right or wrong, good or bad. A Process Reward Model (PRM) reads the solution step by step and emits one number per step: was this line a correct, helpful move from here?
Why the difference matters, made concrete. A model solves a math problem in five steps, gets steps 1 to 4 perfectly right, makes an arithmetic slip in step 5, and lands on the wrong answer. The ORM sees a wrong answer and assigns reward 0 to the entire trajectory, punishing the four good steps along with the one bad one. The PRM assigns roughly +1, +1, +1, +1, then 0, isolating the blame to exactly where the reasoning broke.
This is the sparse versus dense reward distinction from the diffusion and RL world, now in language. ORM is the single end of sequence reward that made the PPO critic hard to train; PRM is the dense per token signal that PRIME tried to manufacture from the KL term. A PRM is a learned, explicit version of that dense signal, trained to grade reasoning.
What Counts as a Step?
Before you can grade steps you must define one. A step is a contiguous chunk of reasoning that can be judged as a single move. The boundary is a design choice, and the right granularity depends on the domain.
In math, a step is usually one line of the derivation, split on newlines or on sentence boundaries. Consider this solution to "a train travels 60 miles in 1.5 hours, then 40 miles in 0.5 hours, what is its average speed?":
In code, a step might be a logical block: a function, a loop body, a single transformation. Here a model builds a function to find the maximum of a list, and one step contains a classic bug:
The interview nuance: too fine and steps are unjudgeable (a single token carries no verdict), too coarse and you are back to outcome grading (one step equals the whole solution). The sweet spot is the unit at which a reasonable grader can say "yes, this move was sound" or "no, here is where it went wrong". Most math PRMs use line level steps; the OpenAI PRM800K dataset uses solution lines as steps.
What Does a Step Label Even Mean?
Say a step is "correct". Correct how? There are two definitions, and conflating them is the most common conceptual error in the field.
Definition A, correctness. Is this step true and free of error in isolation? Step 2 above, total time = 2.0 hours, is simply correct arithmetic.
Definition B, value or potential. Does this step move us toward a correct final answer? This is the reinforcement learning definition: a step is good if a strong solver, continuing from here, is likely to reach the right answer. A step can be locally correct yet low value (a true but useless tangent) or locally surprising yet high value (a clever non obvious lemma).
The PRM800K human labels use a three way correctness scheme, positive, neutral, negative. Most automatically labeled PRMs use the value definition, because it is what you can measure without a human: roll out many completions from a step and see how often they succeed. That measured success rate is the step's value, and it connects PRMs straight back to the value functions of your RL blogs.
A PRM that learns value is not checking truth, it is forecasting success. Under this definition a step that is mathematically valid but leads down a dead end gets a low label, and a step that takes a known productive shortcut gets a high one even before the shortcut is justified. This is why PRMs trained on rollouts behave like critics, and why they can reward reasoning that looks unusual but reliably works.
Where Labels Come From Without an Army of Humans
The original PRM (OpenAI, 2023) was trained on 800,000 human step labels, expensive and unscalable. The breakthrough that made PRMs practical is automatic labeling by rollout, the Math-Shepherd method. The idea is pure Monte Carlo, straight from the foundations blog: to estimate a step's value, complete it many times and count successes.
Walk the procedure on a concrete step. The model is partway through a problem and has written steps 1 and 2; we want a label for the partial solution after step 2. Sample, say, 8 independent completions from that prefix and check each final answer against the known ground truth:
Two labeling conventions you will be asked to distinguish. Soft labels keep the fraction 0.75 as a regression target. Hard labels threshold it: any step from which at least one rollout succeeds is labeled good, which marks a step as good if the correct answer is still reachable from it. Math-Shepherd compared both; soft labels generally carry more information.
The crucial efficiency point: you only need a verifier for the final answer, not for the steps. In math the verifier checks the boxed answer against ground truth. In code it runs the unit tests. The expensive thing, judging intermediate reasoning, is bootstrapped entirely from the cheap thing, checking the final outcome. This is how PRM training data is now generated at scale.
Rollout labeling is compute heavy: k completions per step, many steps, many problems. Refinements attack this. Some methods use a binary search over steps to find the first error in O(log n) rollouts instead of labeling every step. Others (the entropy and tree based successors) reuse a shared rollout tree so completions are amortized across steps. The principle stays: trade cheap final answer checks plus compute for expensive human judgment.
Training the Grader
With labeled steps in hand, the PRM is usually the base language model with its prediction head replaced by a tiny scalar head. At a designated marker token after each step (a special token, or a fixed delimiter), the model outputs one number, the predicted step value. Training is per step regression or classification:
One design decision interviewers probe: the PRM must score a step using only the prefix up to and including that step, never the future. It is a forward looking judge, like a value function, not a retrospective one. If it could peek at later steps it would be grading outcomes again. The marker token placement enforces this causally: the score at step t sees tokens up to t and no further, exactly the causal masking from the attention blog doing useful work.
At inference the PRM reads a solution and emits a vector of step scores. These must be aggregated into one solution score for ranking. Three standard reductions, each with a different temperament:
Using a PRM at Inference: Search and Reranking
The simplest and most popular use needs no RL at all. Generate many candidate solutions, score each with the PRM, keep the best. This is best-of-N with PRM reranking, and it reliably beats both greedy decoding and majority vote on hard reasoning, because the PRM can reward a correct but rare solution that majority vote would drown out.
But the real power is guiding generation as it unfolds, turning the PRM into the value function of a search. In step level beam search, at each step you expand several candidate next steps, score them all with the PRM, and keep only the top few partial solutions to extend, pruning doomed branches before wasting tokens on them:
Push this further and you get PRM guided Monte Carlo Tree Search, where the PRM provides the value estimate that MCTS uses to decide which branches to explore, the same MCTS from the agentic RL blog, now with a learned reasoning critic instead of a game simulator. This is the family of methods behind the strong test time compute scaling results: spend more inference compute exploring the tree, let the PRM steer, and accuracy climbs.
PRMs Inside RL Training
Reranking and search use a frozen PRM at inference. The other use puts the PRM inside the training loop as the reward signal, and this is where it connects to everything from your PPO and GRPO blogs. Recall the central pain there: the reward arrives only at the final token, so 499 of 500 tokens get no task signal and credit assignment is slow. A PRM fixes this directly by paying a reward at every step.
Plug this dense reward into PPO or GRPO and intermediate good steps are reinforced immediately, without waiting for the value function to slowly propagate credit from the end. The advantage estimate at step t now reflects whether this step helped, not just whether the whole trajectory eventually succeeded. Math-Shepherd and PRIME both demonstrated PRM rewards lifting reasoning performance inside GRPO style training.
But the danger scales with the density. A dense learned reward is a dense attack surface for reward hacking: the policy will find steps that the PRM scores highly but that are not actually good reasoning, the same gaming that the KL penalty guards against for the outcome reward model. The standard defenses transfer directly: keep a KL anchor to a reference policy, refresh or retrain the PRM as the policy shifts (an online PRM, the PRIME approach), and blend the dense PRM signal with a sparse but trustworthy outcome verifier so the final ground truth always has a vote.
The Pitfalls That Get Asked About
PRMs are powerful and brittle in specific, predictable ways. Knowing the failure modes is what separates a surface answer from a deep one.
Reward hacking under the value definition. Because auto labels measure success probability, a PRM can over reward steps that correlate with success rather than cause it: confident phrasing, common formatting, steps that resemble those in easy problems. The policy then learns the surface features, not the reasoning.
Label noise from the verifier. Rollout labeling trusts the final answer checker. If the checker is loose, a wrong path that stumbles into the right numeric answer gets labeled good (false positive), and a correct path that the checker cannot parse gets labeled bad (false negative). Garbage final verification poisons every step label built on it.
Distribution shift during RL. A PRM trained on one policy's rollouts becomes miscalibrated as RL pushes the policy into new regions of reasoning space the PRM never saw, the same staleness that forces target networks and online reward models elsewhere. A frozen PRM slowly stops measuring what it was trained to measure.
The fundamental tension, process versus outcome correctness. The deepest issue. A step can be a valid logical move that happens to lead to a dead end, or an invalid move that luckily reaches the right answer. Process labels and outcome labels genuinely disagree on these cases, and no labeling scheme resolves it perfectly. PRMs that optimize for value can mark sound but unlucky reasoning as bad, subtly teaching the model to avoid legitimate exploration.
The PRMs People Actually Use
The landscape is no longer research demos. A handful of open PRMs are downloadable today and serve as both rerankers and reward sources. They cluster into three families, each a direct descendant of a method from the earlier sections.
Family 1: Monte Carlo labeled, discriminative (the workhorses)
These follow the §4 rollout recipe to make labels, then train a scalar head as in §5. They are the default choice for best-of-N and search.
Qwen2.5-Math-PRM-7B and -72B
Currently the strongest open math PRMs. Built on Qwen2.5-Math-Instruct, trained on a blend of human (PRM800K) and Monte Carlo rollout labels, with careful filtering to remove steps whose label is inconsistent between the two sources. The 7B is the practical sweet spot; the 72B tops the ProcessBench leaderboard for error localization.
Skywork-o1-Open-PRM (1.5B and 7B)
Also Qwen2.5-Math based, tuned for both math and code, and notable for strong performance at the tiny 1.5B size, which makes it cheap enough to run as an inline search guide. Ships with vLLM server support out of the box for high throughput best-of-N@64 scoring.
Math-Shepherd-PRM-7B and RLHFlow-PRM (Mistral / Deepseek 8B)
The original automatic-label PRMs. Math-Shepherd defined the rollout labeling recipe; RLHFlow reimplemented it on Llama-3.1 bases with different solution generators. Still common baselines and perfectly usable.
Family 2: Implicit PRMs (no step labels at all)
EurusPRM (Implicit PRM / PRIME lineage)
The cleverest family. Recall from §7 of the diffusion-era PRIME idea that the per-step log-ratio log[π(a|s)/π_ref(a|s)] is itself a reward. Implicit PRM trains only on final outcome labels with a tweaked objective, and the per-step reward falls out for free as that log-ratio, no step annotation and no rollout labeling needed. EurusPRM-Stage1/2 are the released checkpoints. This is the cheapest PRM to train because it needs the same data an ORM does.
Family 3: Generative / reasoning PRMs (the frontier)
Generative PRMs (R-PRM, PRM-CoT and successors)
Instead of emitting a bare scalar, these write a short critique of each step and then judge it, so the verdict comes with a rationale. They turn step grading into a reasoning task the model is already good at, tend to generalize better off-distribution, and can be sampled multiple times and majority-voted. The cost is inference latency: you generate text to score text.
Two benchmarks dominate. ProcessBench measures whether a PRM can locate the first wrong step in a solution. PRMBench stress-tests finer abilities like catching redundant-but-correct steps and resisting confident-sounding errors. A recurring finding worth remembering: a high best-of-N score does not guarantee good error localization, and vice versa, so pick the model by the axis your application needs.
Training and Running One in Practice
This section is the hands-on part: the exact data shape, a Monte Carlo labeling sketch, a training run with Hugging Face TRL, and inference with the Qwen step-token convention plus a vLLM note. Everything here runs on real, current tooling.
Step 1: the data format
A PRM training example is a problem, a list of step strings, and a parallel list of step labels. TRL's PRMTrainer expects exactly three columns: prompt, completions (the list of steps), and labels (a bool or float per step). A single real-shaped row:
{
"prompt": "Weng earns $12/hr babysitting. Yesterday she did 50 min. How much did she earn?",
"completions": [
"Step 1: 50 minutes is 50/60 = 5/6 of an hour.",
"Step 2: She earns 12 * (5/6) = 10 dollars.",
"Step 3: So Weng earned $10. The answer is 10."
],
"labels": [true, true, true] # one label per step; soft floats like 0.75 also allowed
}
A negative example has the error step flagged and, by the common convention from §3, labels stop at the first error (later steps are too ill-defined to label):
{
"prompt": "A train goes 60 mi in 1.5 h then 40 mi in 0.5 h. Average speed?",
"completions": [
"Step 1: total distance = 60 + 40 = 100 miles.",
"Step 2: total time = 1.5 - 0.5 = 1.0 hours.", # wrong: should be +
"Step 3: average = 100 / 1.0 = 100 mph."
],
"labels": [true, false, false]
}
Step 2: generate labels by Monte Carlo rollout
If you do not have human labels, manufacture them with the §4 recipe: from each step prefix, sample completions and score the value as the fraction that reach the verified answer. The core loop, vLLM for fast sampling:
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-Math-7B-Instruct") # the completer / solver
params = SamplingParams(n=8, temperature=0.8, max_tokens=512)
def label_steps(problem, steps, ground_truth):
labels = []
for i in range(len(steps)):
prefix = problem + "\n" + "\n".join(steps[:i+1])
outs = llm.generate([prefix], params)[0].outputs # 8 rollouts from this prefix
wins = sum(verify(o.text, ground_truth) for o in outs)
labels.append(wins / len(outs)) # soft label = success fraction, e.g. 6/8 = 0.75
return labels # or threshold to hard bools
Only verify(), a final-answer checker (boxed-answer match for math, unit tests for code), is needed. The expensive step judgment is bootstrapped from the cheap outcome check, exactly the §4 efficiency point in code.
Step 3: train the PRM with TRL
With a labeled dataset in the three-column format, training is a few lines. TRL inserts a separator token after each step and applies the per-step BCE loss from §5 automatically:
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer
from trl import PRMTrainer, PRMConfig
model_id = "Qwen/Qwen2.5-Math-7B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
# 2 labels = {bad step, good step}; the head is a per-token classifier
model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=2)
dataset = load_dataset("trl-lib/math_shepherd") # prompt / completions / labels
cfg = PRMConfig(
output_dir="my-math-prm",
per_device_train_batch_size=4,
learning_rate=1e-5,
num_train_epochs=1,
step_separator="\n", # where a step ends; a reward token is placed here
max_completion_length=1024,
)
trainer = PRMTrainer(model=model, args=cfg, train_dataset=dataset["train"], processing_class=tok)
trainer.train()
Under the hood TRL tokenizes each step, appends a reward token, masks the loss so only those reward-token positions contribute, and minimizes BCE against your labels. The same recipe runs from the command line on the bundled script:
accelerate launch examples/scripts/prm.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/math_shepherd \
--num_train_epochs 1 \
--output_dir Qwen2-0.5B-Reward-Math-Shepherd
Step 4: run it for scoring
Inference convention differs by model, and the one to know is Qwen's, because it is the strongest and the most copied. Steps are joined by a special <extra_0> token, and the reward for each step is the probability that its <extra_0> position is classified positive:
import torch, torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
name = "Qwen/Qwen2.5-Math-PRM-7B"
tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
model = AutoModel.from_pretrained(name, torch_dtype=torch.bfloat16,
trust_remote_code=True).eval()
steps = [
"Step 1: 50 minutes is 5/6 of an hour.",
"Step 2: 12 * 5/6 = 10 dollars.",
"Step 3: The answer is \\boxed{10}.",
]
messages = [
{"role": "system", "content": "Please reason step by step..."},
{"role": "user", "content": "Weng earns $12/hr..."},
{"role": "assistant", "content": "<extra_0>".join(steps) + "<extra_0>"},
]
text = tok.apply_chat_template(messages, tokenize=False)
ids = tok.encode(text, return_tensors="pt")
logits = model(input_ids=ids)[0]
sep_id = tok.encode("<extra_0>")[0]
mask = (ids == sep_id)
probs = F.softmax(logits, dim=-1) * mask.unsqueeze(-1)
rewards = probs[mask.expand_as(probs[..., 0])].view(-1, 2)[:, 1]
print(rewards.tolist()) # e.g. [1.0, 0.19, 0.98] one reward per step
That output vector, one number per step, is the dense signal everything else consumes: aggregate it by min for best-of-N reranking (§6), feed it branch by branch into a beam or MCTS search (§6), or pipe it as the per-step reward into a GRPO loop (§7). For production scoring at scale, serve the PRM behind a vLLM server and batch hundreds of candidate solutions per call; Skywork and Qwen PRMs both run this way, which is how best-of-64 stays affordable.
Match the step separator to how the solutions were generated (Qwen-Math uses double newlines between steps). Keep the PRM's base model close to the policy's distribution or scores drift. And if you are doing implicit-PRM style training, you do not run any of Step 2 at all: you train on outcome labels only and read the per-step reward off the log-ratio, the cheapest path when step annotation is infeasible.
The Interview Cheat Sheet
| Question | The crisp answer |
|---|---|
| ORM vs PRM in one line | ORM scores the final answer; PRM scores every step, giving dense local credit assignment |
| What is a step | a checkable chunk of reasoning, usually a line in math or a block in code; fine enough to localize blame, coarse enough to carry a verdict |
| What does a label mean | either correctness (true in isolation) or value (probability of eventually reaching the right answer); auto labeling uses value |
| Where labels come from | rollout (Math-Shepherd): complete each prefix k times, label by fraction that reach the verified correct answer; only a final answer checker needed |
| Soft vs hard labels | soft keeps the success fraction as a regression target; hard thresholds it (good if any rollout succeeds) |
| How it is trained | base LLM with a scalar head, per step BCE or MSE against labels, scored causally at a marker token using only the prefix |
| Aggregation | min (strict, the usual default), product, mean, or last step |
| Inference uses | best-of-N reranking, step level beam search, PRM guided MCTS; the basis of test time compute scaling |
| Training use | dense per step reward inside PPO or GRPO, fixing the sparse terminal reward problem |
| Main risks | reward hacking, verifier label noise, distribution shift during RL, process versus outcome disagreement |
The three threads that connect it all
A PRM is a value function. Its auto labels are literally the probability of eventual success from a partial solution, the V(s) of the RL blogs. That is why it can serve as a search heuristic and a dense reward interchangeably.
Its labels come from Monte Carlo. Rollout labeling is exactly the wait-for-the-ending averaging of the foundations blog, applied to a prefix instead of a full episode.
It fixes the sparse reward problem. Everything painful about the single end of sequence reward in PPO and GRPO, slow credit assignment, untrainable critic, is what the dense PRM signal directly addresses, at the price of a richer reward to hack.
A process reward model grades reasoning step by step instead of judging only the final answer, turning a sparse terminal reward into dense local feedback. A step is a checkable chunk; a step label is best understood as its value, the probability that continuing from here reaches the correct answer, which is why labels can be generated automatically by rolling out each prefix many times and counting verified successes, no human step annotation required. The PRM is then a base model with a scalar head trained to predict those labels causally, and at inference it reranks candidates, guides beam search and MCTS, or feeds a dense reward into PPO and GRPO. It is, underneath, a learned value function for reasoning, with all the power and all the reward hacking risk that implies.