Continues from the PPO Deep Dive

Beyond PPO

Every algorithm that came after PPO. What was broken, what changed, and what it means for training your language model. Same notation as the PPO guide, with natural-language prompts and responses throughout.

🔁 Picks up from PPO Deep Dive ✂️ 8 algorithms 🎬 6 animations 🔧 LLM-specific

The Story So Far

In the PPO guide we built up from REINFORCE → Actor-Critic → PPO. We applied it to a LLM and understood every symbol and every gradient. Here's what PPO gave us — and what it didn't.

💬 PPO gave us…

A stable RL training loop with clipped updates (prevents catastrophic steps), a value function baseline (low-variance advantages via GAE), and a KL penalty against a reference policy (prevents reward hacking). Four models: policy πθ, reference πref, reward model rφ, critic Vψ.

Four Problems PPO Did Not Solve

4 models in VRAM simultaneously — 28GB+ for a 7B model

Critic is hard to train from sparse end-of-sequence rewards

Learned reward model can be gamed (reward hacking)

Symmetric ε-clip kills entropy — diverse text ops get suppressed

Single end-reward: 499 tokens get no signal, slow credit assignment

All-correct or all-wrong batches waste compute entirely

Each algorithm below targets one or more of these. The notation stays identical to the PPO guide — new symbols are added for each method and listed explicitly.

Notation Additions (new symbols only)

New notation added in this guide — all PPO symbols carry over unchanged

G	Group size	Number of responses sampled per prompt for advantage estimation. Replaces the critic. Typical: 8–16.
μ_r, σ_r	Group reward stats	Mean and std of rewards within a group of G responses to the same prompt.
ε_low, ε_high	Asymmetric clip bounds	DAPO replaces PPO's symmetric ε with decoupled lower/upper bounds. ε_high > ε_low allows low-prob tokens to grow.
s_i(θ)	Sequence-level ratio	GSPO's single importance ratio per complete response, computed as the semantic mean of per-token ratios. Replaces the product of token-level ratios.
IPR_t	Implicit process reward	PRIME's dense per-step reward: log[πθ(a_t\|s_t) / π_ref(a_t\|s_t)]. Already in our notation as a KL component — PRIME repurposes it as a reward signal.
o_i	Response i in a group	One of G generated responses for the same prompt x. Used in GRPO and all variants.

The Evolution Timeline

2017PPO

The baseline. Stable, broadly applicable. Requires 4 models.

2024GRPO + RLOO + ReMax

Eliminate the critic. Group-based or leave-one-out baselines replace Vψ. Down to 2–3 models.

Jan 2025DeepSeek-R1 + RLVR breakthrough

GRPO + rule-based rewards achieves SOTA reasoning. No learned reward model at all. Transforms the field.

2025 Q1DAPO + REINFORCE++ + Dr. GRPO

Fix GRPO's failure modes: entropy collapse, normalization bias, dynamic sampling.

2025 Q2GSPO + GPG

GSPO moves to sequence-level ratio (fixes MoE instability). GPG proves you can strip everything.

2025PRIME

Dense process rewards without human annotation. Every step of the response sequence gets a reward signal.

GRPO — Kill the Critic

GRPO's single insight: for LLMs, you can replace the expensive value function V_ψ(s_t) with the group mean reward. Sample G responses to the same prompt; each response's advantage is simply how far its reward deviates from the group's average.

Changes from PPO

Requires V_ψ(s_t) — a full-size critic model trained simultaneously

GAE advantage: Â_t = δ_t + (γλ)δ_{t+1} + ... requires V at every token

4 models in memory: πθ, πref, rφ, Vψ

Group advantage: Â_i = (r_i − μ_r) / σ_r — no model needed, just G reward evaluations

2 models in memory: πθ (trainable), πref (frozen)

PPO-CLIP objective with clipping at [1−ε, 1+ε] — unchanged

KL penalty against πref — unchanged (added to objective sum)

\hat{A}_i = \frac{r_i - \mu_r}{\sigma_r}, \quad \mu_r = \frac{1}{G}\sum_{j=1}^G r_j, \quad \sigma_r = \sqrt{\frac{1}{G}\sum_j (r_j - \mu_r)^2}

GRPO advantage — the group is the critic. No Vψ needed. Positive means above group average.

\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{x,\{o_i\}}\!\left[\frac{1}{G}\sum_{i=1}^G\sum_t \min\!\left(r_t(\theta)\hat{A}_i,\,\text{clip}(r_t(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i\right) - \beta\,D_{KL}(\pi_\theta\|\pi_{ref})\right]

Full GRPO objective — identical to PPO-CLIP in structure, but Â_i comes from group normalization, not Vψ

🔧 Worked example — a group of G=8 answers to one prompt

Prompt: Explain why the sky appears blue during the day.
The model samples 8 answers. A judge model scores each on factual accuracy and clarity:
r = [0.91, 0.83, 0.78, 0.94, 0.61, 0.88, 0.55, 0.82] → μ_r=0.79, σ_r=0.13
Response 4 (r=0.94): Â_4 = (0.94 − 0.79)/0.13 = +1.15 (well above average, push its token probabilities up)
Response 5 (r=0.61): Â_5 = (0.61 − 0.79)/0.13 = −1.38 (well below average, push its token probabilities down)

Figure 1 — GRPO: The Group of Sampled Answers Is the Critic

Each bar is one answer sampled for the same prompt. The amber dashed line is the group mean, which acts as the free critic. Green bars (above the mean) get positive advantages and their token probabilities increase next step. Red bars (below the mean) get negative advantages and decrease. Click resample to draw a new batch.

DAPO — Four Fixes to GRPO

ByteDance identified four specific failure modes in GRPO when training at scale on long response sequences. Each has a targeted fix.

Changes from GRPO

Symmetric clip: clip(r_t, 1−ε, 1+ε) — kills low-probability tokens

Loss normalized per-response: divide by |o_i| — biases toward short sequences

KL divergence penalty β·D_KL — can destabilize long-sequence training

All batches used — all-correct or all-wrong groups waste compute

Asymmetric clip: clip(r_t, 1−ε_low, 1+ε_high) — ε_high>ε_low, low-prob tokens can grow

Token-level normalization: divide by Σ_i|o_i| — all tokens contribute equally regardless of seq length

KL term removed — stability from asymmetric clip + dynamic sampling instead

Dynamic sampling: filter batches where all G correct OR all G wrong (zero advantage batches)

\mathcal{J}_{DAPO}(\theta) = \mathbb{E}\!\left[\frac{1}{\sum_i|o_i|}\sum_{i=1}^G\sum_t \min\!\left(r_t(\theta)\hat{A}_i,\;\text{clip}(r_t(\theta),1-\varepsilon_{low},1+\varepsilon_{high})\hat{A}_i\right)\right]

DAPO objective — note: normalized by total tokens (not per-response), asymmetric clip, no KL term, filtered batches only

Fix 1 — Clip-Higher: Preventing Entropy Collapse

In GRPO/PPO, the symmetric clip clip(r_t, 0.8, 1.2) treats token probability increases and decreases the same way. But this is asymmetric in effect: for a rare token (think a low-frequency word like "albeit" or a specialized term like "Rayleigh"), even a small absolute increase in probability turns into a very high ratio r_t(θ) ≫ 1, and the clip kills the gradient immediately. The model can never learn to use these words even when they would be perfect for the answer.

DAPO sets ε_low=0.2, ε_high=0.28. The clip for increasing probability is looser. Rare-but-valid tokens get room to grow during training.

Figure 2 — Clip-Higher: GRPO vs DAPO Token Entropy

Probability mass for three vocabulary buckets over 2000 training steps. "the" is a common token; "scattering" is a moderate-frequency word for this prompt; "Rayleigh" is rare but the technically correct term. GRPO (left) clips the "Rayleigh" gradient aggressively, so its probability decays to zero and the word is never used again. DAPO (right) with ε_high=0.28 lets it keep a viable probability throughout training.

Fix 2 — Dynamic Sampling: Skip Zero-Gradient Batches

When all G responses in a group are either all good (reward ≈ 1) or all bad (reward ≈ 0), the group advantage Â_i = (r_i − μ_r)/σ_r is near-zero for every sample. The gradient is essentially zero. DAPO filters these batches out entirely. Only batches with mixed outcomes (at least one good and one bad answer) are used for training.

Figure 3 — Dynamic Sampling: Filter Zero-Gradient Batches

Incoming training batches. Each box is one of G=6 sampled answers (green = the judge accepted it, red = rejected). All-accepted and all-rejected batches have zero advantage everywhere — no gradient signal. DAPO discards them and only trains on mixed batches, which speeds up effective learning and avoids wasted compute.

Fix 3 — Token-Level Normalization: Fix Length Bias

GRPO normalizes the loss by each response's length: sum_t / |o_i|. A correct 4-token answer gets a gradient 3× larger per token than a correct 12-token answer with equal reward. This pushes the model to prefer terse responses even when a more detailed one is genuinely better.

DAPO divides by the total tokens across all G responses: sum_i sum_t / (sum_i |o_i|). Every token contributes equally, regardless of which response it sits in.

🔧 Length bias, worked out

Group of 2 answering "Why is the sky blue?":
Response A = "Sunlight scatters in air." (4 tokens, r=0.85)
Response B = "Shorter blue wavelengths scatter more than longer red wavelengths in the atmosphere." (12 tokens, r=0.85)
Same reward, but very different sentence quality.
GRPO: A gradient weight = 1/4 = 0.25 per token. B gradient weight = 1/12 = 0.083 per token. A gets 3× more gradient pull per token despite the longer answer being more informative.
DAPO: both normalized by 4+12=16 total tokens. A weight = 1/16. B weight = 1/16. Equal treatment.

REINFORCE++ — PPO Stability, No Critic

REINFORCE++ takes the opposite design philosophy to DAPO: instead of stripping things from GRPO, it adds PPO's stability tricks back to a clean REINFORCE baseline. The result is a method that matches PPO's training behavior without needing V_ψ.

Changes from vanilla REINFORCE

No baseline — full reward as advantage signal — very high variance

No update constraint — large steps can destroy the policy

Group mean baseline (from GRPO) — Â_i = r_i − μ_r — variance reduction

Per-token KL penalty β·log(πθ/πref) — from PPO, prevents reward hacking

Clipped importance ratio r_t(θ) at [1−ε, 1+ε] — from PPO, prevents catastrophic steps

Advantage normalization across the batch — from GRPO, stabilizes gradient scale

No critic Vψ — same as GRPO

\hat{A}_i^{R++} = r_i - \mu_r \quad \text{(no } \div\sigma_r\text{, no critic)}

REINFORCE++ advantage — mean-centered only. No division by standard deviation (avoids the bias Dr. GRPO later identifies).

The KL penalty is the same per-token formulation from PPO: β·log[πθ(a_t|s_t)/πref(a_t|s_t)] subtracted from the reward at each token. This was removed in DAPO; REINFORCE++ keeps it, providing a softer anchor to the reference policy.

🔧 Why the KL term matters in practice

Without a KL anchor to the reference policy, the model often collapses onto whatever generic phrasing happened to score well early. You see the same hedging openers (think "It is important to note that..." or "In summary,...") on every prompt, because those phrases never lose points and the policy slowly stops exploring. REINFORCE++ keeps the KL anchor while still avoiding the expensive critic. A good middle ground when you want more conservative training than DAPO.

RLOO — Leave-One-Out Baseline

RLOO (REINFORCE Leave-One-Out) makes one targeted fix to GRPO's advantage estimate: the baseline for response i should not include response i itself. GRPO's group mean includes the response being evaluated, introducing a small but measurable bias.

Changes from GRPO

Baseline = μ_r = (1/G)Σ_j=1..G r_j — includes r_i in its own baseline

Baseline = (1/(G-1))Σ_j≠i r_j — leave-one-out: r_i excluded from its own baseline

PPO-CLIP objective — unchanged

No critic, no KL term (optionally added) — unchanged from GRPO

\hat{A}_i^{RLOO} = r_i - \frac{1}{G-1}\sum_{j \neq i} r_j

RLOO advantage — unbiased estimator. Response i is not used in its own baseline. Provably lower bias than GRPO at small G.

💡 When does this bias actually bite?

Most noticeable at small G (4–6). With G=4, GRPO's mean includes 25% of r_i itself, which is a noisy estimate. RLOO's leave-one-out baseline is based on 3 independent samples. At G=16 the difference is negligible. If you're memory-constrained to G=4–6, RLOO is worth the switch.

Dr. GRPO — Remove Normalization Bias

Dr. GRPO identifies a subtle but consequential bug: dividing by σ_r (the group standard deviation) introduces a systematic bias that over-weights hard prompts and under-weights easy ones. In practice, hard prompts with widely varying answer quality dominate the gradient, while easy prompts with uniformly good answers contribute almost nothing, even when the small reward gaps there carry real signal.

Changes from GRPO

Â_i = (r_i − μ_r) / σ_r — division by σ inflates advantages on high-variance (hard) prompts

Â_i = r_i − μ_r — mean-centering only, no σ normalization

All other components identical to GRPO

\hat{A}_i^{DrGRPO} = r_i - \mu_r \quad \text{(no division by } \sigma_r\text{)}

Dr. GRPO advantage — mean-centered but not scaled. Advantages on easy and hard prompts are comparable in magnitude.

⚠️ The σ bias, made concrete

Easy prompt (factual recall, e.g. "What is the capital of France?"): rewards [0.92, 0.90, 0.91, 0.89] → σ=0.01. GRPO: Â_0 = (0.92−0.905)/0.01 = +1.5
Hard prompt (open-ended reasoning, e.g. "Argue both sides of remote work."): rewards [0.94, 0.61, 0.78, 0.55] → σ=0.16. GRPO: Â_0 = (0.94−0.72)/0.16 = +1.375
GRPO treats these advantages as roughly equal, but the easy prompt's raw reward gap is tiny (0.015) compared to the hard prompt's (0.22). Dividing by σ artificially magnifies the easy prompt's signal. Dr. GRPO removes that distortion.

GSPO — Sequence-Level Importance Ratio

GSPO makes a deeper structural argument: the probability ratio r_t(θ) = πθ(a_t|s_t)/πold(a_t|s_t) is at the wrong granularity. We apply a token-level correction to a sequence-level reward. The mismatch compounds over long responses, creating high variance — especially in MoE architectures where expert routing can differ between numerator and denominator.

Changes from GRPO

Token-level ratio: r_t(θ) = πθ(a_t|s_t) / πold(a_t|s_t) — T separate ratios per response

Per-token clip applied independently to each r_t

Sequence-level ratio: s_i(θ) = exp((1/|o_i|) Σ_t log[πθ/πold]) — one ratio per full response

Single clip applied to s_i — aligns the optimization unit with the reward unit

Group advantage Â_i — same as GRPO

KL penalty — unchanged

s_i(\theta) = \exp\!\left(\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\log\frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}\right)

GSPO sequence-level ratio — semantic mean of per-token ratios. One stable number per response instead of T noisy ones.

\mathcal{J}_{GSPO}(\theta)=\mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \min\!\left(s_i(\theta)\hat{A}_i,\;\text{clip}(s_i(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i\right)\right]

GSPO objective — identical structure to GRPO/PPO but uses s_i (one ratio per response) instead of r_t (one ratio per token)

Figure 4 — Token-Level vs Sequence-Level Ratio: Variance Comparison

A 7-token answer "Dropout randomly zeroes units during training." Left: GRPO computes 7 separate token ratios — large variance, some land at ratio 1.6 (too high, gets clipped), some at 0.5 (too low). Right: GSPO collapses them into a single semantic mean ≈ 1.04, which sits comfortably inside the [0.8, 1.2] clip range. One stable update signal per response.

GPG — The Bare Minimum

GPG asks a provocative question: after stripping away the critic (GRPO), asymmetric clip, and KL penalty (DAPO), the sequence-level ratio (GSPO) — what actually remains? How much of PPO's machinery is truly necessary for verifiable rewards?

The answer: just the policy gradient with a group mean baseline. No surrogate objective, no clipping, no KL term, no reference model.

Changes from GRPO

min(r_t·Â, clip(r_t,1±ε)·Â) — clipped surrogate objective

KL divergence penalty β·D_KL(πθ||πref)

πref — reference model not needed

Direct log-probability gradient: Σ_t log πθ(a_t|s_t) · Â_i

No clipping — full gradient at every step

Group mean baseline Â_i = r_i − μ_r — from GRPO, unchanged

\mathcal{J}_{GPG}(\theta)=\mathbb{E}_{x,\{o_i\}}\!\left[\frac{1}{G}\sum_{i=1}^G\sum_{t=1}^{|o_i|}\log\pi_\theta(a_t|s_t)\cdot\hat{A}_i\right]

GPG objective — classical REINFORCE with group mean baseline. No surrogate loss, no clip, no KL, no reference model.

⚠️ When to reach for GPG

GPG is competitive with GRPO on math and code benchmarks when rewards are verifiable and dense enough. With a smooth scalar reward and a small learning rate, the missing clip may not hurt much in practice. But with binary 0/1 rewards (the answer is either right or wrong), the lack of clipping can produce unstable updates. DAPO is the safer default; treat GPG as a useful ablation baseline.

PRIME — Dense Process Rewards

Every method so far inherits PPO's fundamental reward structure: a single scalar at the end of the sequence. For a 500-token model response, 499 tokens receive zero reward signal. PRIME breaks this entirely. It assigns a dense reward to every reasoning step — with no human annotation and no separate PRM model.

The Key Insight: The Log-Ratio Is Already a Reward

From the PPO deep dive, we defined the implicit reward from the optimal policy derivation:

r(x,y) = \beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta\log Z(x)

Recall from Section 2 of the PPO guide — the reward is implicit in the policy ratio. This is how DPO derived its loss.

PRIME extends this to the token level. At each step t, define the Implicit Process Reward (IPR):

IPR_t = \log\frac{\pi_\theta(a_t|s_t)}{\pi_{ref}(a_t|s_t)}

PRIME implicit process reward — already computed in every PPO/GRPO update (it IS the per-token KL). PRIME reuses it as a dense reward signal.

Changes from GRPO

Sparse reward: r̃_t = r(x,y) only at final token t=T, zero elsewhere

PRM requires expensive human step-level annotations to train

Dense reward: r̃_t = γ·IPR_t + (1−γ)·r(x,y)·[t==T] at every token

PRM updated online using only policy rollouts + outcome labels — no human annotations

Group sampling and advantage estimation from GRPO — unchanged

PPO-CLIP objective structure — unchanged

\tilde{r}_t^{PRIME} = \underbrace{\gamma \cdot \text{IPR}_t}_{\text{dense process reward}} + \underbrace{(1-\gamma)\cdot r(x,y)\cdot\mathbf{1}[t=T]}_{\text{outcome reward at final step}}

PRIME's modified reward at token t. γ controls the balance. The IPR term is free — it's the log-ratio we already compute for the KL penalty.

Figure 5 — PRIME vs GRPO: Dense vs Sparse Rewards Across the Answer

An answer being written step by step: topic sentence → first claim → supporting citation → second claim → another citation → conclusion. GRPO (bottom row) hands out zero reward to every step except the very last, so the model gets no feedback on whether the middle of the answer was reasonable. PRIME (top row) uses the implicit process reward IPR_t at every step, telling the model immediately whether each new sentence moved it closer to a strong final answer.

🔧 Why dense rewards help a language model

Long-form answers are sequential and compositional. Each sentence builds on the previous ones, and the conclusion is only as good as the claims that led to it. With GRPO's sparse reward, the model only learns "the final answer scored 0.83" — it never finds out whether the topic sentence, the first claim, or the citation was the weak link.

PRIME gives a signal at every step. If the next token is one the trained policy now favors more than the reference policy did, IPR > 0 (we are moving in a direction the reward signal endorses). If it is one the policy is now less confident about, IPR < 0. This step-level feedback dramatically improves credit assignment for tokens early in the response.

Full Comparison

Method	Removes	Adds / Changes	Models	Reach for it when…
PPO	—	Baseline	4×	You have a learned reward model and plenty of VRAM
GRPO	Critic Vψ	Group mean baseline	2×	Good default when rewards are verifiable
DAPO	KL term	Asymmetric clip, dynamic sampling, token-norm	2×	⭐ Recommended default — fixes entropy collapse on rare words
REINFORCE++	Critic Vψ	Adds KL + clip back to REINFORCE	2×	You want conservative training with a KL anchor
RLOO	Critic Vψ	Leave-one-out baseline (unbiased)	2×	Small group size G≤6, want an unbiased advantage
Dr. GRPO	σ_r normalization	Mean-only centering	2×	Training on a mix of easy and hard prompts
GSPO	Token-level ratio	Sequence-level semantic mean ratio	2×	Long responses (T>500 tokens) or MoE architectures
GPG	Clip, KL, πref	Pure REINFORCE + group mean	1×	Memory-extremely-constrained, smooth reward function
PRIME	Sparse reward	Dense IPR at every step	2×	⭐ Highly recommended — step-level feedback for long answers

A Practical Recipe

Figure 6 — Algorithm Decision Flow for LLM Post-Training

Decision tree for choosing your algorithm. The key branching points: do you have a verifiable reward? (yes = RLVR family) Can you define a step-level reward? (yes = PRIME) Are you training on long responses or MoE models? (yes = GSPO). The starred recommendation combines DAPO with PRIME's dense reward signal.

REF

References

Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO). arXiv:2402.03300
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948
Yu, Q. et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476
Hu, J. (2025). REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models. arXiv:2501.03262
Ahmadian, A. et al. (2024). Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback (RLOO). arXiv:2402.14740
Liu, Z. et al. (2025). Understanding R1-Zero-Like Training: A Critical Perspective (Dr. GRPO). COLM 2025. arXiv:2503.20783
Zheng, C. et al. (2025). GSPO: Group Sequence Policy Optimization. arXiv:2507.18071
Chu, X. et al. (2025). GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning. arXiv:2504.02546
Yuan, L. et al. (2025). PRIME: Process Reinforcement through Implicit Rewards. arXiv:2502.01456