Continues from the PPO Deep Dive

Beyond PPO

Every algorithm that came after PPO. What was broken, what changed, and what it means for training your language model. Same notation as the PPO guide, with natural-language prompts and responses throughout.

🔁 Picks up from PPO Deep Dive ✂️ 8 algorithms 🎬 6 animations 🔧 LLM-specific
Beyond PPO overview
00

The Story So Far

In the PPO guide we built up from REINFORCE → Actor-Critic → PPO. We applied it to a LLM and understood every symbol and every gradient. Here's what PPO gave us — and what it didn't.

💬 PPO gave us…

A stable RL training loop with clipped updates (prevents catastrophic steps), a value function baseline (low-variance advantages via GAE), and a KL penalty against a reference policy (prevents reward hacking). Four models: policy πθ, reference πref, reward model , critic .

Four Problems PPO Did Not Solve

4 models in VRAM simultaneously — 28GB+ for a 7B model
Critic is hard to train from sparse end-of-sequence rewards
Learned reward model can be gamed (reward hacking)
Symmetric ε-clip kills entropy — diverse text ops get suppressed
Single end-reward: 499 tokens get no signal, slow credit assignment
All-correct or all-wrong batches waste compute entirely

Each algorithm below targets one or more of these. The notation stays identical to the PPO guide — new symbols are added for each method and listed explicitly.

Notation Additions (new symbols only)

New notation added in this guide — all PPO symbols carry over unchanged
GGroup sizeNumber of responses sampled per prompt for advantage estimation. Replaces the critic. Typical: 8–16.
μr, σrGroup reward statsMean and std of rewards within a group of G responses to the same prompt.
εlow, εhighAsymmetric clip boundsDAPO replaces PPO's symmetric ε with decoupled lower/upper bounds. εhigh > εlow allows low-prob tokens to grow.
si(θ)Sequence-level ratioGSPO's single importance ratio per complete response, computed as the semantic mean of per-token ratios. Replaces the product of token-level ratios.
IPRtImplicit process rewardPRIME's dense per-step reward: log[πθ(at|st) / πref(at|st)]. Already in our notation as a KL component — PRIME repurposes it as a reward signal.
oiResponse i in a groupOne of G generated responses for the same prompt x. Used in GRPO and all variants.

The Evolution Timeline

2017PPO
The baseline. Stable, broadly applicable. Requires 4 models.
2024GRPO + RLOO + ReMax
Eliminate the critic. Group-based or leave-one-out baselines replace Vψ. Down to 2–3 models.
Jan 2025DeepSeek-R1 + RLVR breakthrough
GRPO + rule-based rewards achieves SOTA reasoning. No learned reward model at all. Transforms the field.
2025 Q1DAPO + REINFORCE++ + Dr. GRPO
Fix GRPO's failure modes: entropy collapse, normalization bias, dynamic sampling.
2025 Q2GSPO + GPG
GSPO moves to sequence-level ratio (fixes MoE instability). GPG proves you can strip everything.
2025PRIME
Dense process rewards without human annotation. Every step of the response sequence gets a reward signal.
01

GRPO — Kill the Critic

GRPO's single insight: for LLMs, you can replace the expensive value function Vψ(st) with the group mean reward. Sample G responses to the same prompt; each response's advantage is simply how far its reward deviates from the group's average.

Changes from PPO
Requires Vψ(s_t) — a full-size critic model trained simultaneously
GAE advantage: Â_t = δ_t + (γλ)δ_{t+1} + ... requires V at every token
4 models in memory: πθ, πref, rφ, Vψ
Group advantage: Â_i = (r_i − μ_r) / σ_r — no model needed, just G reward evaluations
2 models in memory: πθ (trainable), πref (frozen)
PPO-CLIP objective with clipping at [1−ε, 1+ε] — unchanged
KL penalty against πref — unchanged (added to objective sum)
\hat{A}_i = \frac{r_i - \mu_r}{\sigma_r}, \quad \mu_r = \frac{1}{G}\sum_{j=1}^G r_j, \quad \sigma_r = \sqrt{\frac{1}{G}\sum_j (r_j - \mu_r)^2}
GRPO advantage — the group is the critic. No Vψ needed. Positive means above group average.
\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{x,\{o_i\}}\!\left[\frac{1}{G}\sum_{i=1}^G\sum_t \min\!\left(r_t(\theta)\hat{A}_i,\,\text{clip}(r_t(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i\right) - \beta\,D_{KL}(\pi_\theta\|\pi_{ref})\right]
Full GRPO objective — identical to PPO-CLIP in structure, but Â_i comes from group normalization, not Vψ
🔧 Worked example — a group of G=8 answers to one prompt

Prompt: Explain why the sky appears blue during the day.
The model samples 8 answers. A judge model scores each on factual accuracy and clarity:
r = [0.91, 0.83, 0.78, 0.94, 0.61, 0.88, 0.55, 0.82] → μr=0.79, σr=0.13
Response 4 (r=0.94): Â_4 = (0.94 − 0.79)/0.13 = +1.15 (well above average, push its token probabilities up)
Response 5 (r=0.61): Â_5 = (0.61 − 0.79)/0.13 = −1.38 (well below average, push its token probabilities down)

Figure 1 — GRPO: The Group of Sampled Answers Is the Critic
Each bar is one answer sampled for the same prompt. The amber dashed line is the group mean, which acts as the free critic. Green bars (above the mean) get positive advantages and their token probabilities increase next step. Red bars (below the mean) get negative advantages and decrease. Click resample to draw a new batch.
02

DAPO — Four Fixes to GRPO

ByteDance identified four specific failure modes in GRPO when training at scale on long response sequences. Each has a targeted fix.

Changes from GRPO
Symmetric clip: clip(r_t, 1−ε, 1+ε) — kills low-probability tokens
Loss normalized per-response: divide by |o_i| — biases toward short sequences
KL divergence penalty β·D_KL — can destabilize long-sequence training
All batches used — all-correct or all-wrong groups waste compute
Asymmetric clip: clip(r_t, 1−εlow, 1+εhigh) — εhighlow, low-prob tokens can grow
Token-level normalization: divide by Σi|oi| — all tokens contribute equally regardless of seq length
KL term removed — stability from asymmetric clip + dynamic sampling instead
Dynamic sampling: filter batches where all G correct OR all G wrong (zero advantage batches)
\mathcal{J}_{DAPO}(\theta) = \mathbb{E}\!\left[\frac{1}{\sum_i|o_i|}\sum_{i=1}^G\sum_t \min\!\left(r_t(\theta)\hat{A}_i,\;\text{clip}(r_t(\theta),1-\varepsilon_{low},1+\varepsilon_{high})\hat{A}_i\right)\right]
DAPO objective — note: normalized by total tokens (not per-response), asymmetric clip, no KL term, filtered batches only

Fix 1 — Clip-Higher: Preventing Entropy Collapse

In GRPO/PPO, the symmetric clip clip(r_t, 0.8, 1.2) treats token probability increases and decreases the same way. But this is asymmetric in effect: for a rare token (think a low-frequency word like "albeit" or a specialized term like "Rayleigh"), even a small absolute increase in probability turns into a very high ratio rt(θ) ≫ 1, and the clip kills the gradient immediately. The model can never learn to use these words even when they would be perfect for the answer.

DAPO sets εlow=0.2, εhigh=0.28. The clip for increasing probability is looser. Rare-but-valid tokens get room to grow during training.

Figure 2 — Clip-Higher: GRPO vs DAPO Token Entropy
Probability mass for three vocabulary buckets over 2000 training steps. "the" is a common token; "scattering" is a moderate-frequency word for this prompt; "Rayleigh" is rare but the technically correct term. GRPO (left) clips the "Rayleigh" gradient aggressively, so its probability decays to zero and the word is never used again. DAPO (right) with εhigh=0.28 lets it keep a viable probability throughout training.

Fix 2 — Dynamic Sampling: Skip Zero-Gradient Batches

When all G responses in a group are either all good (reward ≈ 1) or all bad (reward ≈ 0), the group advantage Â_i = (r_i − μ_r)/σ_r is near-zero for every sample. The gradient is essentially zero. DAPO filters these batches out entirely. Only batches with mixed outcomes (at least one good and one bad answer) are used for training.

Figure 3 — Dynamic Sampling: Filter Zero-Gradient Batches
Incoming training batches. Each box is one of G=6 sampled answers (green = the judge accepted it, red = rejected). All-accepted and all-rejected batches have zero advantage everywhere — no gradient signal. DAPO discards them and only trains on mixed batches, which speeds up effective learning and avoids wasted compute.

Fix 3 — Token-Level Normalization: Fix Length Bias

GRPO normalizes the loss by each response's length: sum_t / |o_i|. A correct 4-token answer gets a gradient 3× larger per token than a correct 12-token answer with equal reward. This pushes the model to prefer terse responses even when a more detailed one is genuinely better.

DAPO divides by the total tokens across all G responses: sum_i sum_t / (sum_i |o_i|). Every token contributes equally, regardless of which response it sits in.

🔧 Length bias, worked out

Group of 2 answering "Why is the sky blue?":
Response A = "Sunlight scatters in air." (4 tokens, r=0.85)
Response B = "Shorter blue wavelengths scatter more than longer red wavelengths in the atmosphere." (12 tokens, r=0.85)
Same reward, but very different sentence quality.
GRPO: A gradient weight = 1/4 = 0.25 per token. B gradient weight = 1/12 = 0.083 per token. A gets 3× more gradient pull per token despite the longer answer being more informative.
DAPO: both normalized by 4+12=16 total tokens. A weight = 1/16. B weight = 1/16. Equal treatment.

03

REINFORCE++ — PPO Stability, No Critic

REINFORCE++ takes the opposite design philosophy to DAPO: instead of stripping things from GRPO, it adds PPO's stability tricks back to a clean REINFORCE baseline. The result is a method that matches PPO's training behavior without needing Vψ.

Changes from vanilla REINFORCE
No baseline — full reward as advantage signal — very high variance
No update constraint — large steps can destroy the policy
Group mean baseline (from GRPO) — Â_i = r_i − μ_r — variance reduction
Per-token KL penalty β·log(πθ/πref) — from PPO, prevents reward hacking
Clipped importance ratio r_t(θ) at [1−ε, 1+ε] — from PPO, prevents catastrophic steps
Advantage normalization across the batch — from GRPO, stabilizes gradient scale
No critic Vψ — same as GRPO
\hat{A}_i^{R++} = r_i - \mu_r \quad \text{(no } \div\sigma_r\text{, no critic)}
REINFORCE++ advantage — mean-centered only. No division by standard deviation (avoids the bias Dr. GRPO later identifies).

The KL penalty is the same per-token formulation from PPO: β·log[πθ(a_t|s_t)/πref(a_t|s_t)] subtracted from the reward at each token. This was removed in DAPO; REINFORCE++ keeps it, providing a softer anchor to the reference policy.

🔧 Why the KL term matters in practice

Without a KL anchor to the reference policy, the model often collapses onto whatever generic phrasing happened to score well early. You see the same hedging openers (think "It is important to note that..." or "In summary,...") on every prompt, because those phrases never lose points and the policy slowly stops exploring. REINFORCE++ keeps the KL anchor while still avoiding the expensive critic. A good middle ground when you want more conservative training than DAPO.

04

RLOO — Leave-One-Out Baseline

RLOO (REINFORCE Leave-One-Out) makes one targeted fix to GRPO's advantage estimate: the baseline for response i should not include response i itself. GRPO's group mean includes the response being evaluated, introducing a small but measurable bias.

Changes from GRPO
Baseline = μ_r = (1/G)Σj=1..G r_j — includes r_i in its own baseline
Baseline = (1/(G-1))Σj≠i r_j — leave-one-out: r_i excluded from its own baseline
PPO-CLIP objective — unchanged
No critic, no KL term (optionally added) — unchanged from GRPO
\hat{A}_i^{RLOO} = r_i - \frac{1}{G-1}\sum_{j \neq i} r_j
RLOO advantage — unbiased estimator. Response i is not used in its own baseline. Provably lower bias than GRPO at small G.
💡 When does this bias actually bite?

Most noticeable at small G (4–6). With G=4, GRPO's mean includes 25% of r_i itself, which is a noisy estimate. RLOO's leave-one-out baseline is based on 3 independent samples. At G=16 the difference is negligible. If you're memory-constrained to G=4–6, RLOO is worth the switch.

05

Dr. GRPO — Remove Normalization Bias

Dr. GRPO identifies a subtle but consequential bug: dividing by σr (the group standard deviation) introduces a systematic bias that over-weights hard prompts and under-weights easy ones. In practice, hard prompts with widely varying answer quality dominate the gradient, while easy prompts with uniformly good answers contribute almost nothing, even when the small reward gaps there carry real signal.

Changes from GRPO
Â_i = (r_i − μ_r) / σ_r — division by σ inflates advantages on high-variance (hard) prompts
Â_i = r_i − μ_r — mean-centering only, no σ normalization
All other components identical to GRPO
\hat{A}_i^{DrGRPO} = r_i - \mu_r \quad \text{(no division by } \sigma_r\text{)}
Dr. GRPO advantage — mean-centered but not scaled. Advantages on easy and hard prompts are comparable in magnitude.
⚠️ The σ bias, made concrete

Easy prompt (factual recall, e.g. "What is the capital of France?"): rewards [0.92, 0.90, 0.91, 0.89] → σ=0.01. GRPO: Â_0 = (0.92−0.905)/0.01 = +1.5
Hard prompt (open-ended reasoning, e.g. "Argue both sides of remote work."): rewards [0.94, 0.61, 0.78, 0.55] → σ=0.16. GRPO: Â_0 = (0.94−0.72)/0.16 = +1.375
GRPO treats these advantages as roughly equal, but the easy prompt's raw reward gap is tiny (0.015) compared to the hard prompt's (0.22). Dividing by σ artificially magnifies the easy prompt's signal. Dr. GRPO removes that distortion.

06

GSPO — Sequence-Level Importance Ratio

GSPO makes a deeper structural argument: the probability ratio r_t(θ) = πθ(a_t|s_t)/πold(a_t|s_t) is at the wrong granularity. We apply a token-level correction to a sequence-level reward. The mismatch compounds over long responses, creating high variance — especially in MoE architectures where expert routing can differ between numerator and denominator.

Changes from GRPO
Token-level ratio: r_t(θ) = πθ(a_t|s_t) / πold(a_t|s_t) — T separate ratios per response
Per-token clip applied independently to each r_t
Sequence-level ratio: s_i(θ) = exp((1/|o_i|) Σ_t log[πθ/πold]) — one ratio per full response
Single clip applied to s_i — aligns the optimization unit with the reward unit
Group advantage Â_i — same as GRPO
KL penalty — unchanged
s_i(\theta) = \exp\!\left(\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\log\frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}\right)
GSPO sequence-level ratio — semantic mean of per-token ratios. One stable number per response instead of T noisy ones.
\mathcal{J}_{GSPO}(\theta)=\mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \min\!\left(s_i(\theta)\hat{A}_i,\;\text{clip}(s_i(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i\right)\right]
GSPO objective — identical structure to GRPO/PPO but uses s_i (one ratio per response) instead of r_t (one ratio per token)
Figure 4 — Token-Level vs Sequence-Level Ratio: Variance Comparison
A 7-token answer "Dropout randomly zeroes units during training." Left: GRPO computes 7 separate token ratios — large variance, some land at ratio 1.6 (too high, gets clipped), some at 0.5 (too low). Right: GSPO collapses them into a single semantic mean ≈ 1.04, which sits comfortably inside the [0.8, 1.2] clip range. One stable update signal per response.
07

GPG — The Bare Minimum

GPG asks a provocative question: after stripping away the critic (GRPO), asymmetric clip, and KL penalty (DAPO), the sequence-level ratio (GSPO) — what actually remains? How much of PPO's machinery is truly necessary for verifiable rewards?

The answer: just the policy gradient with a group mean baseline. No surrogate objective, no clipping, no KL term, no reference model.

Changes from GRPO
min(r_t·Â, clip(r_t,1±ε)·Â) — clipped surrogate objective
KL divergence penalty β·D_KL(πθ||πref)
πref — reference model not needed
Direct log-probability gradient: Σ_t log πθ(a_t|s_t) · Â_i
No clipping — full gradient at every step
Group mean baseline Â_i = r_i − μ_r — from GRPO, unchanged
\mathcal{J}_{GPG}(\theta)=\mathbb{E}_{x,\{o_i\}}\!\left[\frac{1}{G}\sum_{i=1}^G\sum_{t=1}^{|o_i|}\log\pi_\theta(a_t|s_t)\cdot\hat{A}_i\right]
GPG objective — classical REINFORCE with group mean baseline. No surrogate loss, no clip, no KL, no reference model.
⚠️ When to reach for GPG

GPG is competitive with GRPO on math and code benchmarks when rewards are verifiable and dense enough. With a smooth scalar reward and a small learning rate, the missing clip may not hurt much in practice. But with binary 0/1 rewards (the answer is either right or wrong), the lack of clipping can produce unstable updates. DAPO is the safer default; treat GPG as a useful ablation baseline.

08

PRIME — Dense Process Rewards

Every method so far inherits PPO's fundamental reward structure: a single scalar at the end of the sequence. For a 500-token model response, 499 tokens receive zero reward signal. PRIME breaks this entirely. It assigns a dense reward to every reasoning step — with no human annotation and no separate PRM model.

The Key Insight: The Log-Ratio Is Already a Reward

From the PPO deep dive, we defined the implicit reward from the optimal policy derivation:

r(x,y) = \beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta\log Z(x)
Recall from Section 2 of the PPO guide — the reward is implicit in the policy ratio. This is how DPO derived its loss.

PRIME extends this to the token level. At each step t, define the Implicit Process Reward (IPR):

IPR_t = \log\frac{\pi_\theta(a_t|s_t)}{\pi_{ref}(a_t|s_t)}
PRIME implicit process reward — already computed in every PPO/GRPO update (it IS the per-token KL). PRIME reuses it as a dense reward signal.
Changes from GRPO
Sparse reward: r̃_t = r(x,y) only at final token t=T, zero elsewhere
PRM requires expensive human step-level annotations to train
Dense reward: r̃_t = γ·IPR_t + (1−γ)·r(x,y)·[t==T] at every token
PRM updated online using only policy rollouts + outcome labels — no human annotations
Group sampling and advantage estimation from GRPO — unchanged
PPO-CLIP objective structure — unchanged
\tilde{r}_t^{PRIME} = \underbrace{\gamma \cdot \text{IPR}_t}_{\text{dense process reward}} + \underbrace{(1-\gamma)\cdot r(x,y)\cdot\mathbf{1}[t=T]}_{\text{outcome reward at final step}}
PRIME's modified reward at token t. γ controls the balance. The IPR term is free — it's the log-ratio we already compute for the KL penalty.
Figure 5 — PRIME vs GRPO: Dense vs Sparse Rewards Across the Answer
An answer being written step by step: topic sentence → first claim → supporting citation → second claim → another citation → conclusion. GRPO (bottom row) hands out zero reward to every step except the very last, so the model gets no feedback on whether the middle of the answer was reasonable. PRIME (top row) uses the implicit process reward IPR_t at every step, telling the model immediately whether each new sentence moved it closer to a strong final answer.
🔧 Why dense rewards help a language model

Long-form answers are sequential and compositional. Each sentence builds on the previous ones, and the conclusion is only as good as the claims that led to it. With GRPO's sparse reward, the model only learns "the final answer scored 0.83" — it never finds out whether the topic sentence, the first claim, or the citation was the weak link.

PRIME gives a signal at every step. If the next token is one the trained policy now favors more than the reference policy did, IPR > 0 (we are moving in a direction the reward signal endorses). If it is one the policy is now less confident about, IPR < 0. This step-level feedback dramatically improves credit assignment for tokens early in the response.

09

Full Comparison

MethodRemovesAdds / ChangesModelsReach for it when…
PPOBaselineYou have a learned reward model and plenty of VRAM
GRPOCritic VψGroup mean baselineGood default when rewards are verifiable
DAPOKL termAsymmetric clip, dynamic sampling, token-norm⭐ Recommended default — fixes entropy collapse on rare words
REINFORCE++Critic VψAdds KL + clip back to REINFORCEYou want conservative training with a KL anchor
RLOOCritic VψLeave-one-out baseline (unbiased)Small group size G≤6, want an unbiased advantage
Dr. GRPOσ_r normalizationMean-only centeringTraining on a mix of easy and hard prompts
GSPOToken-level ratioSequence-level semantic mean ratioLong responses (T>500 tokens) or MoE architectures
GPGClip, KL, πrefPure REINFORCE + group meanMemory-extremely-constrained, smooth reward function
PRIMESparse rewardDense IPR at every step⭐ Highly recommended — step-level feedback for long answers

A Practical Recipe

Figure 6 — Algorithm Decision Flow for LLM Post-Training
Decision tree for choosing your algorithm. The key branching points: do you have a verifiable reward? (yes = RLVR family) Can you define a step-level reward? (yes = PRIME) Are you training on long responses or MoE models? (yes = GSPO). The starred recommendation combines DAPO with PRIME's dense reward signal.
REF

References

  1. Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO). arXiv:2402.03300
  2. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948
  3. Yu, Q. et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476
  4. Hu, J. (2025). REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models. arXiv:2501.03262
  5. Ahmadian, A. et al. (2024). Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback (RLOO). arXiv:2402.14740
  6. Liu, Z. et al. (2025). Understanding R1-Zero-Like Training: A Critical Perspective (Dr. GRPO). COLM 2025. arXiv:2503.20783
  7. Zheng, C. et al. (2025). GSPO: Group Sequence Policy Optimization. arXiv:2507.18071
  8. Chu, X. et al. (2025). GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning. arXiv:2504.02546
  9. Yuan, L. et al. (2025). PRIME: Process Reinforcement through Implicit Rewards. arXiv:2502.01456