Complete Technical Reference

Proximal Policy
Optimization Explained

Every symbol defined. Every intuition built from scratch. A step-by-step animated walkthrough of how PPO trains a language model, from the first token to the final gradient update.

📖 ~45 min read 📐 18 equations 🎬 6 animations 💻 Full training code 📝 Text examples

Why do we need RL at all?

tl;dr A pretrained LLM is good at predicting the next token. That is not the same as being useful, polite, or correct. PPO is the bridge from "good at next token" to "good at answering humans."

Picture the model you have right now. It saw a few trillion tokens during pretraining. It can finish sentences from books, write code, recite Wikipedia, and even hold a conversation if you nudge it the right way. But it has one big limitation: it was only ever taught to predict the next token. It was never taught that this answer is better than that answer.

Supervised fine tuning (SFT) helps a little. You take a dataset of question and answer pairs written by humans, and you fine tune the model to imitate them. After SFT, the model knows the shape of a good answer. But SFT teaches it the average of your demonstrations, not the best of them. It also gives the model no notion of "this response was better than that one." It only sees one target per prompt.

What we actually want is to push the model toward generations that humans prefer. That is a comparison task. Comparison data is cheap to collect: show two answers and click the better one. Imitation data is expensive: write the full perfect answer yourself.

So the question becomes: how do we use comparison signal to update the weights of a pretrained transformer? The answer is reinforcement learning. PPO is the specific algorithm that almost every RLHF pipeline uses, including the original InstructGPT and ChatGPT trainings.

Figure 1 · Where PPO lives in the LLM lifecycle

Four stages of training a modern LLM. PPO sits at the very end, after pretraining (next-token prediction on the web), SFT (imitation learning on demonstrations), and reward model training (learning a quality score from preferences). Each box pulses in sequence; in real training the arrows are months of compute and a lot of human annotation.

The setup: what you walk in with

tl;dr Before PPO begins you need three things: an SFT-tuned model, a frozen copy of it called the reference, and a reward model. PPO does not need labelled answers. It just needs prompts.

By the time you start PPO, you should already have:

An SFT model. Your pretrained LLM, fine tuned on a dataset of high-quality demonstrations. This is the starting point for the policy that PPO will update. Call it π_SFT.
A reward model. A separate model that takes a (prompt, response) pair and returns a scalar score. We will see exactly how this is trained in the next section. Call it R(x, y).
A pile of prompts. Just prompts. No answers needed. This is the dataset PPO will train on. The model will generate its own responses on the fly.

That last point is worth pausing on. PPO does not train on a (prompt, target answer) dataset. It only needs prompts. The model produces candidate answers itself during training, scores them with the reward model, and updates its weights based on those scores. That is the fundamental shift from supervised learning.

Our running example

We will use one prompt the entire way through this post. The user asks the model:

What are some healthy breakfast options for someone with diabetes?

And the model produces this response (one of several it could sample):

"Good breakfast options for someone with diabetes include steel-cut oatmeal topped with berries and nuts, plain Greek yogurt with chia seeds, or a vegetable omelette with whole-grain toast. These choices release glucose slowly and pair carbs with protein and fibre."

A reward model scores this response R = 0.91. We will trace what PPO does with that signal, all the way down to the parameter update.

Where does the reward come from?

tl;dr The reward model is a transformer with a scalar head, trained on pairs of responses where a human said which one is better. It learns to predict that preference. PPO then uses its outputs as the reward signal.

The reward signal is the heart of the whole pipeline. Without it, PPO has nothing to optimize. So how do you build a function that takes any (prompt, response) and tells you how good the response is?

Step 1: collect preferences

Take your SFT model. For each prompt in your dataset, sample k different responses by setting the temperature high enough that you get variation. A typical setup is k = 4 to 8 responses per prompt. Now you have a bag of (prompt, response₁, response₂, ..., response_k) tuples.

Show each pair of responses to a human (or sometimes to a stronger LLM acting as a judge) and ask: which one is better? The human clicks. You record the preference. The dataset looks like:

preference_data.jsonl

{
  "prompt": "What are some healthy breakfast options for someone with diabetes?",
  "chosen": "Good options include steel-cut oatmeal with berries, plain Greek yogurt with chia seeds, or a vegetable omelette with whole-grain toast...",
  "rejected": "Just eat whatever you want, breakfast doesn't really matter."
}
{
  "prompt": "Explain why the sky is blue.",
  "chosen": "Sunlight is made of many colors. As it passes through the atmosphere, the shorter blue wavelengths scatter more than the others, so the sky looks blue from the ground.",
  "rejected": "Because of the ocean reflection."
}

Step 2: train the reward model

Take a fresh copy of your SFT model. Replace its language modeling head with a single scalar head (one linear layer, output dim 1). Now it takes a (prompt, response) pair and returns one number.

Train it with the Bradley-Terry loss. Given a chosen and a rejected response for the same prompt, the loss wants the chosen one to score higher than the rejected one:

\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_c, y_r) \sim \mathcal{D}}\!\left[\,\log \sigma\!\left( R_\phi(x, y_c) - R_\phi(x, y_r) \right) \right]

Sigmoid of the score gap. Penalizes the reward model when the chosen response scores lower than the rejected one.

After a few epochs, the reward model has internalized human taste, at least the slice of taste captured by your annotations. It can now score any new response on any prompt.

Figure 2 · Reward model training pipeline

Reward model training. The SFT model samples several responses per prompt at high temperature. Humans (or a strong LLM judge) rank them. The reward model is then trained to predict these rankings, so its scores agree with human preference. After PPO begins, only this trained reward model is consulted, not humans.

How is this any better than just training on demonstrations?

Demonstrations tell you what a good answer looks like. Preferences tell you what makes one answer better than another. The second signal is finer-grained and much cheaper to collect: ranking two answers takes a few seconds; writing a perfect answer takes minutes. The reward model is the trick that turns thousands of cheap rank labels into a continuous score function you can apply to any new response.

Translating LLMs into RL language

tl;dr An LLM generating tokens is the same shape as an agent taking actions in an environment. Token = action. Prompt + so-far = state. Full response = trajectory. Reward = score at the end. Once you see this mapping, the rest is just notation.

RL papers use a vocabulary that does not appear anywhere in a transformer paper. Here is the cheat sheet:

RL term	What it actually is for an LLM
policy π_θ	Your transformer. Outputs a probability distribution over the vocabulary at every position.
action a_t	The token chosen at position t. Sampled from π_θ(· \| s_t).
state s_t	The prompt x plus every token generated so far: (a₁, ..., a_t-1).
trajectory τ	One complete (prompt, response) pair. The whole episode.
horizon T	How many tokens the model generated before stopping. Variable per episode.
reward r_t	Almost always zero for t < T. The reward model only fires at the end of the response.
return G_t	Sum of future rewards from step t. For LLMs, basically R(x, y) for every t.

Two things are worth dwelling on.

The action space is enormous. Most RL papers picture games with maybe a dozen possible actions. For an LLM the action space is the whole vocabulary, 30,000 to 200,000 tokens depending on tokenizer. That changes the math of exploration: random sampling becomes incredibly weak because nearly every random choice is gibberish.

The reward is sparse. You generate hundreds of tokens. The reward model fires exactly once, at the end. So one scalar reward has to be back-propagated as a learning signal across the entire sequence. PPO's job is to figure out which tokens deserve credit (or blame) for that final score. This is called credit assignment, and it is the central technical difficulty.

Figure 3 · One PPO rollout, token by token

One PPO rollout. The prompt enters the policy. Tokens stream out one by one, each conditional on everything that came before. Once the response is finished, it goes to the reward model, which returns a single scalar score. That score is the only learning signal PPO will see for this trajectory.

Why not just SFT on high-reward outputs?

tl;dr Filtering for high-reward samples and running SFT on them (rejection sampling) is a real method and it does work. But it cannot decrease the probability of bad outputs. PPO can.

A natural reaction at this point: "Why not just sample a thousand responses, keep only the ones that score high, and supervised-fine-tune on those?" This is called rejection sampling fine tuning or best-of-N SFT. It works. It is also simpler than PPO. So why do people bother with PPO?

Two reasons:

SFT cannot push probability away from bad outputs. When you fine tune on token sequences, the cross-entropy loss only ever increases the probability of the tokens that appear in your training data. It never explicitly punishes the tokens that should not appear. PPO's clipped objective, in contrast, has positive-advantage and negative-advantage cases: it increases good-token probability and decreases bad-token probability in the same step.
SFT throws away most of your data. If you sample 8 responses and keep only the top 1, you discarded 7/8 of the compute spent generating. PPO uses all 8 responses, weighted by how good or bad they were. This is a much denser learning signal per dollar of inference.

That said, hybrids are common in practice. Many production pipelines do rejection sampling SFT first to get a strong starting point, then run PPO from there. The combination is more reliable than either method alone.

Policy gradient: the foundation

tl;dr To maximize expected reward, take the gradient of log probability of each action and weight it by the reward of the whole trajectory. This is REINFORCE. It is unbiased and badly behaved.

The goal is to maximize the expected reward our policy gets across the prompt distribution:

J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\ y \sim \pi_\theta(\cdot \mid x)}\left[ R(x, y) \right]

Expected reward, average over prompts × sampled responses.

We cannot just take ∇J(θ) directly because the expectation is over samples from π_θ itself, and the sampling depends on θ. The policy gradient theorem handles this by moving the gradient inside the expectation:

\nabla_\theta J(\theta) = \mathbb{E}_{x, y \sim \pi_\theta}\!\left[\,R(x, y) \cdot \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]

REINFORCE. Reward times the sum of log-probability gradients along the trajectory.

The intuition is exactly what it looks like: multiply the reward by the gradient of log-probability of the action you took, and that is the direction to push your parameters in. If R is positive, the gradient step makes those actions more likely. If R is negative, less likely.

For the running example, the trajectory τ is the seven-ish tokens of the answer about oatmeal. R(τ) = 0.91. The gradient step nudges the model toward producing those tokens (in that context) more often. Simple in principle.

Why REINFORCE breaks in practice

Two reasons. Variance: a single trajectory's reward can swing wildly between batches, so the gradient estimate is noisy. You need huge batches or many epochs to average it out. Sample inefficiency: as soon as you take one gradient step, your old samples are technically off-policy and you should throw them out. Generating a fresh batch every step is wildly expensive.

The next two sections fix variance (with baselines and advantages) and sample inefficiency (with importance sampling and clipping). Those two fixes together are PPO.

Baselines and advantages

tl;dr Subtract a baseline from the reward before weighting the log probability. The same gradient in expectation, much lower variance. The best baseline is a value function, which gives you the advantage.

Here is the trick. The policy gradient is unchanged in expectation if you subtract any quantity that does not depend on the action:

\nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_{t=1}^T \left(R(\tau) - b(s_t)\right) \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]

Same expectation as REINFORCE, but the variance drops dramatically with a good choice of b(s).

Why is the expectation unchanged? Because the expectation of (∇ log π · b(s)) is zero whenever b(s) doesn't depend on the action — a standard property called the score function identity. So subtracting any state-only term is a free variance reduction.

The best baseline is the one that makes (R − b) as small as possible while still being state-only. The optimal choice (it can be shown) is the value function: the expected return from state s under the current policy. We train a small network to estimate it.

V_\psi(s) \approx \mathbb{E}_{\pi_\theta}\!\left[ R(\tau) \mid s_t = s \right]

Value function. Predicts how much reward we expect from this state, on average.

And then the advantage is the gap:

A_t = R(\tau) - V_\psi(s_t)

Did this trajectory exceed our expected return from state s_t? If yes, A > 0 and we should boost the action that led here.

Two networks now. The actor (your LLM with its usual LM head) generates tokens. The critic (your LLM with a scalar head bolted on) estimates V. In practice both share the same transformer body for memory reasons, and only the heads differ.

The critic is trained with plain regression against actual returns:

\mathcal{L}_V(\psi) = \mathbb{E}_t\!\left[ \left( V_\psi(s_t) - R_t \right)^2 \right]

Critic regression target. We want V to predict the actual return as accurately as possible.

Off-policy and importance sampling

tl;dr Generating fresh rollouts after every gradient step is too expensive. We want to update the policy several times on each rollout. To do that safely, we weight each sample by the ratio of new policy to old policy probabilities. That ratio can blow up; PPO's job is to keep it tame.

If you stick to REINFORCE-with-baseline, you must regenerate the entire batch of trajectories after every single gradient step. That is incredibly slow when generating one trajectory costs hundreds of forward passes through a 7B-parameter model.

The workaround is importance sampling. You collect rollouts with a frozen snapshot of the policy, call it π_old. You then run several gradient updates on the current policy π_θ, using the same rollouts. To compensate for the mismatch (the data was sampled from π_old, but you are training π_θ), you weight each token by the ratio:

\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{old}(a_t \mid s_t)}

Probability ratio. If π_θ and π_old give the same probability to a_t, ρ = 1. If π_θ became much more confident in a_t, ρ > 1. Less confident, ρ < 1.

The surrogate objective becomes:

\mathcal{L}^{IS}(\theta) = \mathbb{E}_t\!\left[ \rho_t(\theta) \cdot \hat A_t \right]

Importance-sampled surrogate. Equivalent to vanilla policy gradient when ρ = 1 everywhere.

This is unbiased as long as ρ stays small. The problem is that ρ can blow up. If the new policy starts assigning much higher probability to some token than the old one did, ρ could become 5, 10, 100. The estimator's variance explodes. Worse, a few outlier ratios can dominate the gradient.

This is the exact problem PPO solves.

The clipped surrogate, finally

tl;dr Take the importance-weighted objective. Hard-clip the ratio so it cannot leave [1 − ε, 1 + ε]. Take the minimum of clipped and unclipped versions. That is it. That is PPO.

The clipped objective:

\mathcal{L}^{CLIP}(\theta) = \mathbb{E}_t\!\left[\,\min\!\big(\rho_t(\theta) \cdot \hat A_t,\;\; \text{clip}\big(\rho_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big) \cdot \hat A_t \big)\,\right]

The center of PPO. Bounded above and below; no incentive to push ρ outside [1 − ε, 1 + ε].

That min looks strange at first. Why is it there? Walk through the four cases:

A > 0 (good action) and ρ < 1 + ε

The new policy hasn't pushed too hard yet. The unclipped term is in effect: gradient pushes ρ up further, making the action more likely.

A > 0 and ρ ≥ 1 + ε

The new policy has already increased this action's probability by more than ε. The clip caps the objective. Pushing ρ higher gives zero extra reward, so the gradient stops.

A < 0 (bad action) and ρ > 1 − ε

The new policy hasn't pushed this action down enough yet. The unclipped term lets the gradient continue pushing ρ down.

A < 0 and ρ ≤ 1 − ε

The new policy already decreased this action's probability by more than ε. The clip caps it. The gradient stops driving it lower.

The min ensures the bound is one-sided in the direction we don't want to go. We are still happy to take gradient steps that pull the ratio back toward 1 (corrective steps), but we won't take ones that push it further away.

Figure 4 · The clipped surrogate, visualized

PPO's clipped surrogate for the A > 0 case. The dashed red line is the unclipped objective ρA. The solid green line is the clip(ρ, 1−ε, 1+ε)·A. The actual PPO loss is the minimum of the two, which forms the kinked shape shown. Once ρ exits the green band, the slope is zero — the policy has no incentive to keep changing in that direction, even if the gradient says it should. The reverse picture (mirror image) holds when A < 0.

The KL penalty: anchoring to the SFT model

tl;dr Even with the clip, PPO can drift the policy far from anything sensible by exploiting the reward model. We add a KL penalty against a frozen reference (the SFT model) so the trained policy cannot wander too far.

The reward model is not perfect. It was trained on a finite set of preferences. If PPO is allowed to optimize hard enough, it will find weird outputs that the reward model accidentally rates highly — outputs no human would actually prefer. This is called reward hacking, and it is the single most common failure mode in RLHF.

The fix is to penalize the policy whenever it strays from the original SFT model. We add a per-token KL term:

r_t^{\text{shaped}} = r_t - \beta \cdot \mathrm{KL}\!\big[\pi_\theta(\cdot \mid s_t)\ \|\ \pi_{ref}(\cdot \mid s_t)\big]

Subtract a KL penalty against the frozen reference at every token. β controls how tight the leash is.

The full PPO objective is then:

\mathcal{L}(\theta, \psi) = \mathcal{L}^{CLIP}(\theta) - c_v \cdot \mathcal{L}_V(\psi) + c_e \cdot \mathcal{H}[\pi_\theta] - \beta \cdot \mathrm{KL}\big[\pi_\theta\,\|\,\pi_{ref}\big]

Clip surrogate + value regression + small entropy bonus − KL leash. Maximized over θ, ψ.

Four terms, three loss coefficients to tune in practice:

c_v: weight on the critic loss. Usually 0.5 to 1.0.
c_e: tiny entropy bonus that discourages collapse. Usually 0.001 to 0.01.
β: how hard you pull back toward the SFT model. The most consequential hyperparameter. Too small and the model reward-hacks. Too large and it never learns anything new. Typical range: 0.01 to 0.2.

Intuition The reference policy is the "best version of you" that you trust. PPO is your attempt to do better, but you don't want to disagree with yourself by more than a few percent on any token. The KL term is the inner voice saying "are you sure you're being reasonable here?"

GAE: smoother advantages

tl;dr The simplest advantage estimate is high-variance. The simplest TD-error is high-bias. GAE interpolates between them with one parameter (λ) and turns out to be the only credit-assignment trick you need.

Advantage estimation has a knob: how many future steps do we look at when computing A_t?

Look at just one step: A_t ≈ r_t + γ V(s_t+1) − V(s_t). Low variance, biased by V's mistakes.
Look all the way to the end: A_t = R(τ) − V(s_t). Unbiased, but high variance because R(τ) is one noisy number.

GAE is the exponentially-weighted average of all the in-between options. Define the one-step TD error:

\delta_t = r_t + \gamma V_\psi(s_{t+1}) - V_\psi(s_t)

TD error: how much better the realized reward + next-state value was than the predicted value at s_t.

Then GAE is:

\hat A_t^{GAE(\gamma, \lambda)} = \sum_{\ell = 0}^{T - t} (\gamma\,\lambda)^\ell\, \delta_{t + \ell}

GAE: exponentially weighted sum of future TD errors. λ = 0 → one-step TD. λ = 1 → Monte Carlo.

For LLMs the typical values are γ = 1.0 (we don't discount future tokens within one response) and λ = 0.95 (heavily weight near-term TD errors, but not exclusively). The recursive form makes it cheap to compute backward in one pass:

gae.py

def compute_gae(rewards, values, gamma=1.0, lam=0.95):
    """
    rewards: tensor of shape [B, T]   — per-token rewards (mostly zero plus KL penalties)
    values:  tensor of shape [B, T+1] — V(s_t) for every state, including terminal
    returns: advantages [B, T], returns [B, T]
    """
    T = rewards.shape[1]
    advantages = torch.zeros_like(rewards)
    last_gae = 0.0
    for t in reversed(range(T)):
        delta    = rewards[:, t] + gamma * values[:, t + 1] - values[:, t]
        last_gae = delta + gamma * lam * last_gae
        advantages[:, t] = last_gae
    returns = advantages + values[:, :T]
    return advantages, returns

One additional step in practice: normalize the advantages per batch (subtract mean, divide by standard deviation) before plugging them into the clip objective. This decouples the learning rate from the absolute scale of rewards, which is a free stability win.

Figure 5 · GAE backward pass on the running example

GAE on the running example. The reward fires almost entirely at the last token (δ_7 = 0.91). The recursion sweeps right to left, smearing that signal back across earlier tokens with an exponentially decaying weight. Every token now has its own advantage, even though the reward only arrived at the end.

The full training loop, end to end

tl;dr Sample rollouts with π_old → score them → compute advantages with GAE → run K inner epochs of clipped updates → sync π_old. Repeat for N outer steps. That is PPO in its entirety.

This is the loop that almost every RLHF training script implements. A few hundred lines of bookkeeping around the math we just walked through. Read it once carefully, and you have the full picture.

ppo_loop.py

# ─── Models ────────────────────────────────────────────────────────────
policy     = load_sft()                            # π_θ — being trained
ref        = load_sft().requires_grad_(False)      # π_ref — frozen leash
value_head = ValueHead(policy)                     # V_ψ — scalar head on policy body
reward_m   = load_reward_model().eval()            # frozen reward model
old        = copy.deepcopy(policy).eval()          # π_old — frozen during rollout

optimiser = AdamW(policy.parameters(), lr=1e-6)

# ─── Outer loop ────────────────────────────────────────────────────────
for step in range(num_steps):

    # ─ Phase 1: rollout ─────────────────────────────────────────────
    prompts   = sample_prompts(batch_size)
    with torch.no_grad():
        responses    = old.generate(prompts, max_new_tokens=512, temperature=1.0)
        old_logprobs = old.log_probs(responses)          # [B, T]
        ref_logprobs = ref.log_probs(responses)          # [B, T]
        values       = value_head(responses)             # [B, T+1]
        R            = reward_m(prompts, responses)      # [B] task reward (one number per response)

        kl_token     = old_logprobs - ref_logprobs       # [B, T]   per-token KL estimate
        per_tok_r    = -beta * kl_token                  # KL shaping at every step
        per_tok_r[:, -1] += R                            # task reward fires only at end

    # ─ Phase 2: advantages via GAE ──────────────────────────────────
    advantages, returns = compute_gae(per_tok_r, values, gamma=1.0, lam=0.95)
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    # ─ Phase 3: K inner epochs of clipped updates ───────────────────
    for epoch in range(K_epochs):                       # K = 1 to 4 typically
        for batch in shuffle_minibatches(...):
            new_logprobs = policy.log_probs(batch.responses)
            ratio        = (new_logprobs - batch.old_logprobs).exp()

            unclipped = ratio * batch.advantages
            clipped   = ratio.clamp(1 - eps, 1 + eps) * batch.advantages
            loss_pg   = -torch.min(unclipped, clipped).mean()

            new_values = value_head(batch.responses)
            loss_v     = F.mse_loss(new_values, batch.returns)
            loss_ent   = -policy.entropy(batch.responses).mean()

            loss = loss_pg + c_v * loss_v + c_e * loss_ent
            optimiser.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
            optimiser.step()

    # ─ Phase 4: sync π_old to current π_θ ──────────────────────────
    old.load_state_dict(policy.state_dict())

Figure 6 · The PPO loop, visually

The full PPO outer loop. The yellow bead traces one outer step: rollout, score, advantage, update, sync. Phase 4 actually runs K times inside one outer step (the inner epochs), making PPO sample-efficient compared to vanilla REINFORCE. Everything else, the gradient norms, the KL controller, the value-loss clipping, lives inside that phase 4 block.

What sizes do these things actually have?

A typical RLHF run on a 7B model looks roughly like this:

Batch size: 64 to 256 prompts per outer step.
Max response length: 512 or 1024 tokens.
Inner epochs K: 1 to 4.
Clip ε: 0.1 or 0.2.
KL coefficient β: 0.01 to 0.1, often adapted on the fly.
Learning rate: 1e-6 to 5e-6. Much lower than during SFT.
Number of outer steps: a few thousand, usually less than 100k.

The dominant cost is generation in phase 1, not the gradient updates. People work hard on vLLM-style fast inference engines just for the rollout step.

Practical pitfalls and how to spot them

tl;dr PPO is famous for being finicky. Most failures show up in just two diagnostics: KL against the reference, and the fraction of clipped tokens per minibatch.

Things that will go wrong, eventually, on every PPO run:

Reward keeps climbing, samples get worse

Classic reward hacking. The model has discovered a quirk in the reward model that humans would not endorse. Increase β, retrain the reward model on the new failure cases, or both.

KL spikes and the model collapses

β is too small or your learning rate is too high. Narrow ε (try 0.1 instead of 0.2), drop the learning rate by an order of magnitude, or use an adaptive β controller that tightens when KL is high.

Entropy crashes toward zero

The policy collapsed onto one response. Raise c_e, widen ε so the clip stops biting on every token, or temperature-sample during rollout.

Critic loss diverges

Value targets are scale-unstable. Standardize advantages, clip the value loss the same way you clip the policy loss, and double-check that your KL shaping isn't producing huge negative per-token rewards.

Training is stable, reward never moves

Clip is too tight; almost every token already sits at the boundary. Widen ε, or warm up with a few hundred steps of vanilla policy gradient before turning the clip on.

Wallclock is awful

Phase 1 (generation) dominates. Use a fast inference engine for the rollouts, cache the reference and old logprobs once per rollout, and consider whether DPO would do the job with one forward pass per sample.

Diagnostics to log on every step

mean reward, KL(π_θ ‖ π_ref), policy entropy, fraction of clipped tokens, mean |advantage|, value loss, gradient norm, longest response in batch. If any of these surprise you, pause and look at twenty random samples before continuing.

One last sanity check: read the samples

Every metric can look fine while the actual outputs degrade. Set up an evaluation loop that runs the current policy on a held-out set of prompts every few hundred steps and writes the responses to disk. Read them with your own eyes. PPO has a special talent for finding loopholes the reward model never noticed.

REF

References and further reading

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 · the original PPO paper
Schulman, J. et al. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438 · the GAE paper
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155 · InstructGPT, the first big RLHF result
Stiennon, N. et al. (2020). Learning to summarize with human feedback. arXiv:2009.01325 · the OpenAI summarisation paper that kickstarted modern RLHF
Bai, Y. et al. (2022). Training a Helpful and Harmless Assistant with RLHF. arXiv:2204.05862 · the Anthropic HH paper
Engstrom, L. et al. (2020). Implementation Matters in Deep Policy Gradients. arXiv:2005.12729 · what actually drives PPO's empirical performance
Huang, S. et al. (2024). The N Implementation Details of RLHF with PPO. HuggingFace blog post · the practical bible if you are going to implement this
Ahmadian, A. et al. (2024). Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback. arXiv:2402.14740 · the case against PPO's complexity for LLMs