Process Reward Models: Grading the Reasoning, Not Just the Answer

The Core Distinction: Outcome vs Process

A reward model scores model outputs. There are two places to put the score. An Outcome Reward Model (ORM) reads the whole solution and emits one number: right or wrong, good or bad. A Process Reward Model (PRM) reads the solution step by step and emits one number per step: was this line a correct, helpful move from here?

Why the difference matters, made concrete. A model solves a math problem in five steps, gets steps 1 to 4 perfectly right, makes an arithmetic slip in step 5, and lands on the wrong answer. The ORM sees a wrong answer and assigns reward 0 to the entire trajectory, punishing the four good steps along with the one bad one. The PRM assigns roughly +1, +1, +1, +1, then 0, isolating the blame to exactly where the reasoning broke.

Same Solution, Two Graders

A five step derivation with one bad step. The ORM collapses everything to a single terminal verdict, so the credit assignment problem is brutal: the learner cannot tell which step to fix. The PRM hands back a dense vector of per step verdicts, pinpointing the failure at step 5 and certifying steps 1 to 4 as good. Dense, local feedback is the entire value proposition.

Where you have seen this exact shape before

This is the sparse versus dense reward distinction from the diffusion and RL world, now in language. ORM is the single end of sequence reward that made the PPO critic hard to train; PRM is the dense per token signal that PRIME tried to manufacture from the KL term. A PRM is a learned, explicit version of that dense signal, trained to grade reasoning.

What Counts as a Step?

Before you can grade steps you must define one. A step is a contiguous chunk of reasoning that can be judged as a single move. The boundary is a design choice, and the right granularity depends on the domain.

In math, a step is usually one line of the derivation, split on newlines or on sentence boundaries. Consider this solution to "a train travels 60 miles in 1.5 hours, then 40 miles in 0.5 hours, what is its average speed?":

\begin{aligned}&\text{step 1: total distance} = 60 + 40 = 100 \text{ miles}\\&\text{step 2: total time} = 1.5 + 0.5 = 2.0 \text{ hours}\\&\text{step 3: average speed} = 100 / 2.0 = 50 \text{ mph}\end{aligned}

Three steps, each a self contained, checkable claim. A PRM scores each line given the question and the lines above it.

In code, a step might be a logical block: a function, a loop body, a single transformation. Here a model builds a function to find the maximum of a list, and one step contains a classic bug:

\texttt{step 1: def find\_max(xs): m = xs[0]}\;\;\checkmark\\ \texttt{step 2: for x in xs[1:]: if x > m: m = x}\;\;\checkmark\\ \texttt{step 3: return m if xs else None}\;\;\;\text{(dead check: xs[0] already crashed on empty)}

Steps 1 and 2 are correct. Step 3's guard is logically dead because step 1 already indexes xs[0]. A PRM trained on code can flag step 3 as a non-improving or buggy move.

The interview nuance: too fine and steps are unjudgeable (a single token carries no verdict), too coarse and you are back to outcome grading (one step equals the whole solution). The sweet spot is the unit at which a reasonable grader can say "yes, this move was sound" or "no, here is where it went wrong". Most math PRMs use line level steps; the OpenAI PRM800K dataset uses solution lines as steps.

Granularity: the Same Solution Cut Three Ways

Token level (top): too fine, each piece is meaningless on its own. Step level (middle): the useful unit, each chunk is a checkable claim. Whole solution (bottom): too coarse, this is just an ORM. The middle cut is what makes process supervision possible: chunks big enough to carry a verdict, small enough to localize blame.

What Does a Step Label Even Mean?

Say a step is "correct". Correct how? There are two definitions, and conflating them is the most common conceptual error in the field.

Definition A, correctness. Is this step true and free of error in isolation? Step 2 above, total time = 2.0 hours, is simply correct arithmetic.

Definition B, value or potential. Does this step move us toward a correct final answer? This is the reinforcement learning definition: a step is good if a strong solver, continuing from here, is likely to reach the right answer. A step can be locally correct yet low value (a true but useless tangent) or locally surprising yet high value (a clever non obvious lemma).

The PRM800K human labels use a three way correctness scheme, positive, neutral, negative. Most automatically labeled PRMs use the value definition, because it is what you can measure without a human: roll out many completions from a step and see how often they succeed. That measured success rate is the step's value, and it connects PRMs straight back to the value functions of your RL blogs.

\text{label}(s_t) \;\approx\; V(s_t) \;=\; \mathbb{P}\big(\text{reach correct answer} \mid \text{prefix up to step } t\big)

The value definition of a step label is exactly a state value from the RL blogs: the probability of eventual success from this partial solution.

The subtlety that trips people up

A PRM that learns value is not checking truth, it is forecasting success. Under this definition a step that is mathematically valid but leads down a dead end gets a low label, and a step that takes a known productive shortcut gets a high one even before the shortcut is justified. This is why PRMs trained on rollouts behave like critics, and why they can reward reasoning that looks unusual but reliably works.

Where Labels Come From Without an Army of Humans

The original PRM (OpenAI, 2023) was trained on 800,000 human step labels, expensive and unscalable. The breakthrough that made PRMs practical is automatic labeling by rollout, the Math-Shepherd method. The idea is pure Monte Carlo, straight from the foundations blog: to estimate a step's value, complete it many times and count successes.

Walk the procedure on a concrete step. The model is partway through a problem and has written steps 1 and 2; we want a label for the partial solution after step 2. Sample, say, 8 independent completions from that prefix and check each final answer against the known ground truth:

\text{8 rollouts from prefix} \to \{\checkmark,\checkmark,\times,\checkmark,\checkmark,\times,\checkmark,\checkmark\} \;\Rightarrow\; \text{label} = \tfrac{6}{8} = 0.75

Hard estimation (HE): the step's value is the fraction of completions that reach the correct answer. No human needed, only a verifier for the final answer.

Two labeling conventions you will be asked to distinguish. Soft labels keep the fraction 0.75 as a regression target. Hard labels threshold it: any step from which at least one rollout succeeds is labeled good, which marks a step as good if the correct answer is still reachable from it. Math-Shepherd compared both; soft labels generally carry more information.

The crucial efficiency point: you only need a verifier for the final answer, not for the steps. In math the verifier checks the boxed answer against ground truth. In code it runs the unit tests. The expensive thing, judging intermediate reasoning, is bootstrapped entirely from the cheap thing, checking the final outcome. This is how PRM training data is now generated at scale.

Auto-Labeling by Rollout: Monte Carlo on a Prefix

From the prefix after step 2, eight completions branch out, each run to a final answer and checked by the verifier. Six reach the correct answer, two fail. The step's value label is 6/8 = 0.75. Repeat this at every step of many solutions and you have a fully labeled PRM training set, built from nothing but a final answer checker. The deeper the rollouts disagree, the more decisive that step was.

The honest cost and its fixes

Rollout labeling is compute heavy: k completions per step, many steps, many problems. Refinements attack this. Some methods use a binary search over steps to find the first error in O(log n) rollouts instead of labeling every step. Others (the entropy and tree based successors) reuse a shared rollout tree so completions are amortized across steps. The principle stays: trade cheap final answer checks plus compute for expensive human judgment.

Training the Grader

With labeled steps in hand, the PRM is usually the base language model with its prediction head replaced by a tiny scalar head. At a designated marker token after each step (a special token, or a fixed delimiter), the model outputs one number, the predicted step value. Training is per step regression or classification:

\mathcal{L}_{PRM} = \sum_t \text{BCE}\big(r_\theta(q, s_{1:t}),\; y_t\big),\qquad r_\theta \in [0,1]

q is the question, s_{1:t} the steps so far, y_t the label from §4. Binary cross entropy for hard labels, MSE for soft. The model reads the prefix and grades the latest step.

One design decision interviewers probe: the PRM must score a step using only the prefix up to and including that step, never the future. It is a forward looking judge, like a value function, not a retrospective one. If it could peek at later steps it would be grading outcomes again. The marker token placement enforces this causally: the score at step t sees tokens up to t and no further, exactly the causal masking from the attention blog doing useful work.

At inference the PRM reads a solution and emits a vector of step scores. These must be aggregated into one solution score for ranking. Three standard reductions, each with a different temperament:

\text{score}(\text{sol}) \in \Big\{\;\underbrace{\min_t r_t}_{\text{any bad step ruins it}},\;\; \underbrace{\prod_t r_t}_{\text{compounding doubt}},\;\; \underbrace{r_T \text{ or } \tfrac{1}{T}\textstyle\sum_t r_t}_{\text{last / mean}}\;\Big\}

min is the popular default: a solution is only as trustworthy as its weakest step, which matches the intuition that one fatal error sinks a proof.

From Step Scores to One Number

Two candidate solutions to the same problem, each a row of per step PRM scores. Solution A is uniformly strong; solution B is excellent until a single weak step. Under the min reduction, B's one weak step drags its solution score below A, correctly preferring the consistently sound derivation. The choice of reduction is a knob: min is strict, mean is forgiving, product sits between.

Using a PRM at Inference: Search and Reranking

The simplest and most popular use needs no RL at all. Generate many candidate solutions, score each with the PRM, keep the best. This is best-of-N with PRM reranking, and it reliably beats both greedy decoding and majority vote on hard reasoning, because the PRM can reward a correct but rare solution that majority vote would drown out.

But the real power is guiding generation as it unfolds, turning the PRM into the value function of a search. In step level beam search, at each step you expand several candidate next steps, score them all with the PRM, and keep only the top few partial solutions to extend, pruning doomed branches before wasting tokens on them:

\text{at each step: expand } k \text{ candidates} \to \text{PRM scores} \to \text{keep top } b \to \text{extend}

The PRM is the heuristic that ranks partial solutions, exactly the role a value estimate plays in any guided tree search.

Push this further and you get PRM guided Monte Carlo Tree Search, where the PRM provides the value estimate that MCTS uses to decide which branches to explore, the same MCTS from the agentic RL blog, now with a learned reasoning critic instead of a game simulator. This is the family of methods behind the strong test time compute scaling results: spend more inference compute exploring the tree, let the PRM steer, and accuracy climbs.

PRM as the Compass of a Search Tree

A reasoning tree growing step by step. At each frontier the PRM scores candidate next steps; high scoring branches (green) are expanded, low scoring ones (red) are pruned before they waste compute. The bold path is the surviving best line of reasoning. Without the PRM the tree explodes uniformly; with it, compute flows toward the branches most likely to succeed. More search plus a good PRM equals higher accuracy, the test time scaling story.

PRMs Inside RL Training

Reranking and search use a frozen PRM at inference. The other use puts the PRM inside the training loop as the reward signal, and this is where it connects to everything from your PPO and GRPO blogs. Recall the central pain there: the reward arrives only at the final token, so 499 of 500 tokens get no task signal and credit assignment is slow. A PRM fixes this directly by paying a reward at every step.

\tilde r_t = \underbrace{r_\theta^{PRM}(q, s_{1:t})}_{\text{dense, per step}} \quad\text{instead of}\quad \tilde r_t = \underbrace{R(\text{answer})\cdot \mathbf{1}[t = T]}_{\text{sparse, terminal only}}

The PRM turns a sparse terminal reward into a dense per step reward, which is exactly what PRIME approximated and what the value function bootstrap struggled to spread backward.

Plug this dense reward into PPO or GRPO and intermediate good steps are reinforced immediately, without waiting for the value function to slowly propagate credit from the end. The advantage estimate at step t now reflects whether this step helped, not just whether the whole trajectory eventually succeeded. Math-Shepherd and PRIME both demonstrated PRM rewards lifting reasoning performance inside GRPO style training.

But the danger scales with the density. A dense learned reward is a dense attack surface for reward hacking: the policy will find steps that the PRM scores highly but that are not actually good reasoning, the same gaming that the KL penalty guards against for the outcome reward model. The standard defenses transfer directly: keep a KL anchor to a reference policy, refresh or retrain the PRM as the policy shifts (an online PRM, the PRIME approach), and blend the dense PRM signal with a sparse but trustworthy outcome verifier so the final ground truth always has a vote.

Sparse Outcome Reward vs Dense PRM Reward in Training

Top: the ORM world, one reward at the end, every intermediate step learning only through slow value bootstrapping. Bottom: the PRM world, a reward attached to every step, so the gradient at each step reflects that step's own contribution. Faster, sharper credit assignment, at the cost of a richer reward to hack, which is why the KL anchor and an outcome verifier stay in the loop.

The Pitfalls That Get Asked About

PRMs are powerful and brittle in specific, predictable ways. Knowing the failure modes is what separates a surface answer from a deep one.

Reward hacking under the value definition. Because auto labels measure success probability, a PRM can over reward steps that correlate with success rather than cause it: confident phrasing, common formatting, steps that resemble those in easy problems. The policy then learns the surface features, not the reasoning.

Label noise from the verifier. Rollout labeling trusts the final answer checker. If the checker is loose, a wrong path that stumbles into the right numeric answer gets labeled good (false positive), and a correct path that the checker cannot parse gets labeled bad (false negative). Garbage final verification poisons every step label built on it.

Distribution shift during RL. A PRM trained on one policy's rollouts becomes miscalibrated as RL pushes the policy into new regions of reasoning space the PRM never saw, the same staleness that forces target networks and online reward models elsewhere. A frozen PRM slowly stops measuring what it was trained to measure.

The fundamental tension, process versus outcome correctness. The deepest issue. A step can be a valid logical move that happens to lead to a dead end, or an invalid move that luckily reaches the right answer. Process labels and outcome labels genuinely disagree on these cases, and no labeling scheme resolves it perfectly. PRMs that optimize for value can mark sound but unlucky reasoning as bad, subtly teaching the model to avoid legitimate exploration.

When Process and Outcome Disagree

Four solutions sorted into a two by two grid: process correct or not, outcome correct or not. The easy diagonal (both right, both wrong) is uncontroversial. The off diagonal is where PRMs struggle: a logically sound path that dead ends, and a flawed path that gets lucky. Value based labels reward the lucky failure of logic and punish the unlucky success of logic, the exact tension no automatic labeler escapes.

The PRMs People Actually Use

The landscape is no longer research demos. A handful of open PRMs are downloadable today and serve as both rerankers and reward sources. They cluster into three families, each a direct descendant of a method from the earlier sections.

Family 1: Monte Carlo labeled, discriminative (the workhorses)

These follow the §4 rollout recipe to make labels, then train a scalar head as in §5. They are the default choice for best-of-N and search.

Qwen2.5-Math-PRM-7B and -72B

Currently the strongest open math PRMs. Built on Qwen2.5-Math-Instruct, trained on a blend of human (PRM800K) and Monte Carlo rollout labels, with careful filtering to remove steps whose label is inconsistent between the two sources. The 7B is the practical sweet spot; the 72B tops the ProcessBench leaderboard for error localization.

Skywork-o1-Open-PRM (1.5B and 7B)

Also Qwen2.5-Math based, tuned for both math and code, and notable for strong performance at the tiny 1.5B size, which makes it cheap enough to run as an inline search guide. Ships with vLLM server support out of the box for high throughput best-of-N@64 scoring.

Math-Shepherd-PRM-7B and RLHFlow-PRM (Mistral / Deepseek 8B)

The original automatic-label PRMs. Math-Shepherd defined the rollout labeling recipe; RLHFlow reimplemented it on Llama-3.1 bases with different solution generators. Still common baselines and perfectly usable.

Family 2: Implicit PRMs (no step labels at all)

EurusPRM (Implicit PRM / PRIME lineage)

The cleverest family. Recall from §7 of the diffusion-era PRIME idea that the per-step log-ratio log[π(a|s)/π_ref(a|s)] is itself a reward. Implicit PRM trains only on final outcome labels with a tweaked objective, and the per-step reward falls out for free as that log-ratio, no step annotation and no rollout labeling needed. EurusPRM-Stage1/2 are the released checkpoints. This is the cheapest PRM to train because it needs the same data an ORM does.

Family 3: Generative / reasoning PRMs (the frontier)

Generative PRMs (R-PRM, PRM-CoT and successors)

Instead of emitting a bare scalar, these write a short critique of each step and then judge it, so the verdict comes with a rationale. They turn step grading into a reasoning task the model is already good at, tend to generalize better off-distribution, and can be sampled multiple times and majority-voted. The cost is inference latency: you generate text to score text.

How they are judged

Two benchmarks dominate. ProcessBench measures whether a PRM can locate the first wrong step in a solution. PRMBench stress-tests finer abilities like catching redundant-but-correct steps and resisting confident-sounding errors. A recurring finding worth remembering: a high best-of-N score does not guarantee good error localization, and vice versa, so pick the model by the axis your application needs.

Training and Running One in Practice

This section is the hands-on part: the exact data shape, a Monte Carlo labeling sketch, a training run with Hugging Face TRL, and inference with the Qwen step-token convention plus a vLLM note. Everything here runs on real, current tooling.

Step 1: the data format

A PRM training example is a problem, a list of step strings, and a parallel list of step labels. TRL's PRMTrainer expects exactly three columns: prompt, completions (the list of steps), and labels (a bool or float per step). A single real-shaped row:

one PRM training row (Math-Shepherd style, JSON)

{
  "prompt": "Weng earns $12/hr babysitting. Yesterday she did 50 min. How much did she earn?",
  "completions": [
    "Step 1: 50 minutes is 50/60 = 5/6 of an hour.",
    "Step 2: She earns 12 * (5/6) = 10 dollars.",
    "Step 3: So Weng earned $10. The answer is 10."
  ],
  "labels": [true, true, true]   # one label per step; soft floats like 0.75 also allowed
}

A negative example has the error step flagged and, by the common convention from §3, labels stop at the first error (later steps are too ill-defined to label):

a row with a mistake at step 2

{
  "prompt": "A train goes 60 mi in 1.5 h then 40 mi in 0.5 h. Average speed?",
  "completions": [
    "Step 1: total distance = 60 + 40 = 100 miles.",
    "Step 2: total time = 1.5 - 0.5 = 1.0 hours.",      # wrong: should be +
    "Step 3: average = 100 / 1.0 = 100 mph."
  ],
  "labels": [true, false, false]
}

Step 2: generate labels by Monte Carlo rollout

If you do not have human labels, manufacture them with the §4 recipe: from each step prefix, sample completions and score the value as the fraction that reach the verified answer. The core loop, vLLM for fast sampling:

auto_label.py : Monte Carlo step labeling

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-Math-7B-Instruct")   # the completer / solver
params = SamplingParams(n=8, temperature=0.8, max_tokens=512)

def label_steps(problem, steps, ground_truth):
    labels = []
    for i in range(len(steps)):
        prefix = problem + "\n" + "\n".join(steps[:i+1])
        outs = llm.generate([prefix], params)[0].outputs   # 8 rollouts from this prefix
        wins = sum(verify(o.text, ground_truth) for o in outs)
        labels.append(wins / len(outs))             # soft label = success fraction, e.g. 6/8 = 0.75
    return labels                                  # or threshold to hard bools

Only verify(), a final-answer checker (boxed-answer match for math, unit tests for code), is needed. The expensive step judgment is bootstrapped from the cheap outcome check, exactly the §4 efficiency point in code.

Step 3: train the PRM with TRL

With a labeled dataset in the three-column format, training is a few lines. TRL inserts a separator token after each step and applies the per-step BCE loss from §5 automatically:

train_prm.py : Hugging Face TRL PRMTrainer

from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer
from trl import PRMTrainer, PRMConfig

model_id = "Qwen/Qwen2.5-Math-7B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
# 2 labels = {bad step, good step}; the head is a per-token classifier
model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=2)

dataset = load_dataset("trl-lib/math_shepherd")   # prompt / completions / labels

cfg = PRMConfig(
    output_dir="my-math-prm",
    per_device_train_batch_size=4,
    learning_rate=1e-5,
    num_train_epochs=1,
    step_separator="\n",        # where a step ends; a reward token is placed here
    max_completion_length=1024,
)

trainer = PRMTrainer(model=model, args=cfg, train_dataset=dataset["train"], processing_class=tok)
trainer.train()

Under the hood TRL tokenizes each step, appends a reward token, masks the loss so only those reward-token positions contribute, and minimizes BCE against your labels. The same recipe runs from the command line on the bundled script:

one-liner with the TRL example script

accelerate launch examples/scripts/prm.py \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/math_shepherd \
  --num_train_epochs 1 \
  --output_dir Qwen2-0.5B-Reward-Math-Shepherd

Step 4: run it for scoring

Inference convention differs by model, and the one to know is Qwen's, because it is the strongest and the most copied. Steps are joined by a special <extra_0> token, and the reward for each step is the probability that its <extra_0> position is classified positive:

score_with_qwen_prm.py : the official inference shape

import torch, torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

name = "Qwen/Qwen2.5-Math-PRM-7B"
tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
model = AutoModel.from_pretrained(name, torch_dtype=torch.bfloat16,
                                  trust_remote_code=True).eval()

steps = [
    "Step 1: 50 minutes is 5/6 of an hour.",
    "Step 2: 12 * 5/6 = 10 dollars.",
    "Step 3: The answer is \\boxed{10}.",
]
messages = [
    {"role": "system", "content": "Please reason step by step..."},
    {"role": "user", "content": "Weng earns $12/hr..."},
    {"role": "assistant", "content": "<extra_0>".join(steps) + "<extra_0>"},
]
text = tok.apply_chat_template(messages, tokenize=False)
ids = tok.encode(text, return_tensors="pt")

logits = model(input_ids=ids)[0]
sep_id = tok.encode("<extra_0>")[0]
mask = (ids == sep_id)
probs = F.softmax(logits, dim=-1) * mask.unsqueeze(-1)
rewards = probs[mask.expand_as(probs[..., 0])].view(-1, 2)[:, 1]
print(rewards.tolist())   # e.g. [1.0, 0.19, 0.98] one reward per step

That output vector, one number per step, is the dense signal everything else consumes: aggregate it by min for best-of-N reranking (§6), feed it branch by branch into a beam or MCTS search (§6), or pipe it as the per-step reward into a GRPO loop (§7). For production scoring at scale, serve the PRM behind a vLLM server and batch hundreds of candidate solutions per call; Skywork and Qwen PRMs both run this way, which is how best-of-64 stays affordable.

Practical gotchas worth stating in an interview

Match the step separator to how the solutions were generated (Qwen-Math uses double newlines between steps). Keep the PRM's base model close to the policy's distribution or scores drift. And if you are doing implicit-PRM style training, you do not run any of Step 2 at all: you train on outcome labels only and read the per-step reward off the log-ratio, the cheapest path when step annotation is infeasible.

The Interview Cheat Sheet

Question	The crisp answer
ORM vs PRM in one line	ORM scores the final answer; PRM scores every step, giving dense local credit assignment
What is a step	a checkable chunk of reasoning, usually a line in math or a block in code; fine enough to localize blame, coarse enough to carry a verdict
What does a label mean	either correctness (true in isolation) or value (probability of eventually reaching the right answer); auto labeling uses value
Where labels come from	rollout (Math-Shepherd): complete each prefix k times, label by fraction that reach the verified correct answer; only a final answer checker needed
Soft vs hard labels	soft keeps the success fraction as a regression target; hard thresholds it (good if any rollout succeeds)
How it is trained	base LLM with a scalar head, per step BCE or MSE against labels, scored causally at a marker token using only the prefix
Aggregation	min (strict, the usual default), product, mean, or last step
Inference uses	best-of-N reranking, step level beam search, PRM guided MCTS; the basis of test time compute scaling
Training use	dense per step reward inside PPO or GRPO, fixing the sparse terminal reward problem
Main risks	reward hacking, verifier label noise, distribution shift during RL, process versus outcome disagreement

The three threads that connect it all

A PRM is a value function. Its auto labels are literally the probability of eventual success from a partial solution, the V(s) of the RL blogs. That is why it can serve as a search heuristic and a dense reward interchangeably.

Its labels come from Monte Carlo. Rollout labeling is exactly the wait-for-the-ending averaging of the foundations blog, applied to a prefix instead of a full episode.

It fixes the sparse reward problem. Everything painful about the single end of sequence reward in PPO and GRPO, slow credit assignment, untrainable critic, is what the dense PRM signal directly addresses, at the price of a richer reward to hack.

The one paragraph summary

A process reward model grades reasoning step by step instead of judging only the final answer, turning a sparse terminal reward into dense local feedback. A step is a checkable chunk; a step label is best understood as its value, the probability that continuing from here reaches the correct answer, which is why labels can be generated automatically by rolling out each prefix many times and counting verified successes, no human step annotation required. The PRM is then a base model with a scalar head trained to predict those labels causally, and at inference it reranks candidates, guides beam search and MCTS, or feeds a dense reward into PPO and GRPO. It is, underneath, a learned value function for reasoning, with all the power and all the reward hacking risk that implies.