🎓 Complete Technical Deep-Dive

Post-Training LLMs:
From RL Intuition to GRPO

A no-nonsense, math-heavy, code-heavy guide to every major post-training algorithm — and how to apply them to your LLM.

📖 ~50 min read🧮 20+ equations💻 5 code snippets🎬 5 animations

PPO

Proximal Policy Optimization

DPO

Direct Preference Optimization

GRPO

Group Relative Policy Opt.

RLHF

RL from Human Feedback

Why Post-Training?

Language models are trained in phases. Understanding what each phase does — and what it cannot do — is the foundation for everything that follows.

The Three Phases

Pre-training — "Learn the world"

Trained on trillions of tokens. Learns syntax, semantics, factual knowledge, multimodal representations. Objective: next-token prediction. Output: capable but raw foundation model.

Supervised Fine-Tuning (SFT) — "Learn the task"

Fine-tune on curated (input, output) demonstration pairs. For text: (image, response_text) pairs. Problem: SFT is mode-averaging — it learns the average of all demonstrations, not the best ones. Also suffers from distribution drift and no exploration.

Post-Training / RLHF — "Learn to be good"

Use reinforcement learning to shift the model's distribution toward high-quality outputs. Instead of imitating demonstrations, the model explores, gets feedback, and maximizes reward. This is where real alignment happens.

🔧 For Your text Model

SFT teaches your model the task format, but produces the average of your training set — not the most valid or accurate. Post-training with computable rewards (output validity, answer score, operation efficiency) pushes it toward genuinely good outputs. Crucially, your rewards are automated and exact — a massive advantage over domains that need human raters.

Timeline

2017

PPO published

Schulman et al. — stable on-policy RL with clipping. Becomes the workhorse.

2022

InstructGPT

Ouyang et al. scale RLHF+PPO to GPT-3. The 3-phase pipeline (SFT → RM → PPO) is established for production.

2023

DPO

Rafailov et al. — skip the reward model. Optimize preferences directly with a closed-form cross-entropy loss.

2024–25

GRPO + DeepSeek-R1

Shao et al. and DeepSeek-AI — group-relative advantages + rule-based rewards. SOTA reasoning, no learned reward model.

Reinforcement Learning — Core Intuition

RL trains an agent (your LLM) to take actions to maximize cumulative reward. For language models, this maps cleanly onto token generation.

The LLM-as-RL Mapping

Classic RL

Policy π — agent's decision rule
State s — current situation
Action a — choice made
Reward r — feedback signal

For LLMs

Policy π_θ — the language model
State s_t — prompt + tokens so far
Action a_t — next token to generate
Reward r — scalar at end of generation

The Master RLHF Objective

The naive objective (maximize E[r]) causes reward hacking. The solution: add a KL divergence penalty that keeps the trained policy close to the reference (SFT) policy:

\max_{\pi_\theta}\;\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot|x)}\!\left[r_\phi(x,y)\right]\;-\;\beta\underbrace{\mathbb{E}_{x}\!\left[D_{KL}\!\left(\pi_\theta(\cdot|x)\,\big\|\,\pi_{\mathrm{ref}}(\cdot|x)\right)\right]}_{\text{stay close to reference}}

Master RLHF objective — β ≈ 0.01–0.2 balances reward vs. reference fidelity

💡 Why KL Divergence?

KL(P∥Q) = E_P[log P − log Q] ≥ 0, equals zero only when P = Q. It penalizes distributions that diverge too much from the reference. Without this constraint, the model exploits weaknesses in the reward model and degenerates into gibberish that scores high but is useless.

The Closed-Form Optimal Policy

Setting the functional gradient to zero gives the optimal policy in closed form — this equation is the cornerstone of DPO's derivation:

\pi^*(y\mid x)=\frac{1}{Z(x)}\,\pi_{\mathrm{ref}}(y\mid x)\,\exp\!\left(\frac{r(x,y)}{\beta}\right),\qquad Z(x)=\sum_y\pi_{\mathrm{ref}}(y|x)\exp\!\left(\tfrac{r(x,y)}{\beta}\right)

Optimal policy — reference policy re-weighted by exponentiated reward. Z(x) is the intractable partition function.

Figure 1 — Policy Distribution Shift During RL Training

The answer blue curve is the trained policy's probability distribution over response quality. It shifts right (toward better outputs) and sharpens as training proceeds. The dashed grey curve is the frozen reference policy. The policy learns to put more mass on high-reward outputs while the KL constraint prevents complete drift.

Rearranging for the Reward Signal

Taking the log of the optimal policy expression and rearranging to solve for r(x,y):

r(x,y)=\beta\log\frac{\pi^*(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}+\underbrace{\beta\log Z(x)}_{\text{only depends on }x}

The reward expressed in terms of the policy ratio — Z(x) will cancel in DPO's derivation!

🔧 text Reward Design

Your reward can combine: validity (does the response sequence compile and produce a valid answer?), answer accuracy (answer score or F1 vs. target answer), and efficiency (fewer operations). These are all computable without human raters — making GRPO with rule-based rewards the ideal choice for your use case.

Reward Modeling

Most RLHF pipelines need a reward model: a neural network scoring a (prompt, response) pair. This section explains how it's built — and when you can skip it entirely.

Preference Data & Bradley-Terry

Instead of absolute ratings (hard, noisy), we collect pairwise preferences: given two responses y_w (winner) and y_l (loser) to prompt x, which is better? The Bradley-Terry model [1952] converts scores to preference probabilities:

P(y_w\succ y_l\mid x)=\sigma\!\left(r(x,y_w)-r(x,y_l)\right)

Bradley-Terry — only score differences matter; absolute scale is arbitrary

Training the reward model r_φ(x,y) via maximum likelihood on preference pairs:

\mathcal{L}_{RM}(\phi)=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(r_\phi(x,y_w)-r_\phi(x,y_l)\right)\right]

Reward model loss — binary cross-entropy on pairwise preferences. Monitor preference accuracy (should reach ~70–90%)

RLHF Pipeline

Figure 2 — The Complete RLHF Pipeline

The RLHF loop. PPO requires a learned reward model (purple box). DPO bypasses it. GRPO replaces the scalar reward with a group-normalized advantage, working with either a learned RM or rule-based functions.

PPO — The Classic Workhorse

PPO [Schulman et al., 2017] was the dominant RLHF algorithm until DPO arrived in 2023. Understand it deeply because all subsequent methods are either simplifications of or reactions to PPO's problems.

From REINFORCE to Actor-Critic

The fundamental algorithm is policy gradient. REINFORCE computes:

\nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\!\left[R(\tau)\,\nabla_\theta\log\pi_\theta(\tau)\right]

REINFORCE — unbiased but extremely high variance. Unusable without a baseline.

Subtract a learned value function baseline V(s_t) to get the advantage A_t = Q(s_t,a_t) − V(s_t). Estimate advantage using Generalized Advantage Estimation (GAE):

\hat{A}_t^{GAE(\gamma,\lambda)}=\sum_{l=0}^{T-t-1}(\gamma\lambda)^l\,\delta_{t+l},\qquad\delta_t=r_t+\gamma V(s_{t+1})-V(s_t)

GAE [Schulman 2015] — λ=0 gives TD(0) (low var, biased); λ=1 gives Monte Carlo (unbiased, high var)

The PPO Clipping Mechanism

Vanilla policy gradient takes large destabilizing steps. PPO clips the probability ratio to a trust region:

r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\mathrm{old}}(a_t\mid s_t)}

L^{CLIP}(\theta)=\mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\;\operatorname{clip}\!\left(r_t(\theta),1-\varepsilon,1+\varepsilon\right)\hat{A}_t\right)\right]

PPO-CLIP objective — ε = 0.2 typical. The min() takes the more conservative (pessimistic) of the two estimates.

Figure 3 — PPO Clipping Mechanism

Left: when advantage A > 0 (good action), we want to increase probability but cap the gain at ratio 1+ε. Right: when A < 0 (bad action), we want to decrease probability but floor the loss at ratio 1−ε. The grey shaded regions are clipped — no gradient flows from there. This prevents overconfident, destabilizing updates.

💡 Intuition for min()

When A > 0: r_t·A says "increase this action's probability." The clip stops us at (1+ε)·A — we can't be too sure. When A < 0: r_t·A says "decrease this action's probability." The clip floors us at (1−ε)·A — we can't be too punishing. The min() is pessimistic: always take the smaller (safer) objective value.

PPO for LLMs: The 4-Model Problem

Policy model (actor) π_θ — being trained

Generates responses. Receives gradient updates each step.

Reference model π_ref — frozen

Frozen copy of the initial SFT model. Computes per-token KL penalty. Never receives gradients.

Reward model r_φ — frozen

Scores complete responses with a scalar. Trained separately before RL. Frozen during RL.

Value model (critic) V_ψ — trained simultaneously

Estimates expected future reward from each token position. Must be trained live alongside the policy. This is the most expensive, most unstable, most painful model in the whole pipeline.

The KL penalty is applied as a per-token reward injection:

\tilde{r}_t=\begin{cases}r_\phi(x,y)-\beta\sum_{t'=1}^T\log\frac{\pi_\theta(a_{t'}|s_{t'})}{\pi_{\mathrm{ref}}(a_{t'}|s_{t'})}&t=T\\0&t

Reward with KL penalty injected at the final token. Intermediate tokens get 0 reward.

⚠️ The Cost of PPO

For a 7B model: 4 × 7B = 28B parameters in VRAM, plus optimizer states, activations, rollout buffers. The critic is notoriously hard to train — if V(s) estimates are wrong, advantages are noisy and the policy diverges. This is the primary motivation for both DPO and GRPO.

PPO Training Code (TRL)

ppo_training.py

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch

# 4 models needed for PPO
model   = AutoModelForCausalLMWithValueHead.from_pretrained("your-sft-model")  # policy + critic
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("your-sft-model")  # reference (frozen)
tokenizer = AutoTokenizer.from_pretrained("your-sft-model")
tokenizer.pad_token = tokenizer.eos_token
reward_pipe = pipeline("text-classification", model="your-reward-model")  # reward model

config = PPOConfig(
    output_dir="./ppo-output",
    learning_rate=1.4e-5,
    batch_size=16,
    mini_batch_size=4,
    ppo_epochs=4,       # PPO epochs per rollout batch
    lam=0.95,           # GAE lambda
    cliprange=0.2,      # epsilon in PPO clip
    init_kl_coef=0.2,   # beta (KL penalty coefficient)
    target_kl=0.1,      # stop updates if KL exceeds this
)

trainer = PPOTrainer(config=config, model=model, ref_model=ref_model, tokenizer=tokenizer)

for batch in trainer.dataloader:
    query_tensors = [b for b in batch["input_ids"]]
    # Generate: sample from current policy
    response_tensors = trainer.generate(query_tensors, max_new_tokens=512, temperature=0.7)
    responses = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
    # Score with reward model
    rewards = [torch.tensor(r["score"]) for r in reward_pipe(responses)]
    # PPO update (computes advantages using value model internally)
    stats = trainer.step(query_tensors, response_tensors, rewards)
    trainer.log_stats(stats, batch, rewards)

DPO — The Elegant Simplification

In 2023, Rafailov et al. published an insight so clean it almost feels like a trick: you don't need a reward model or a RL loop. The reward is already implicit in the ratio between trained and reference policy.

The Key Insight: Z(x) Cancels

Recall from Section 2: r(x,y) = β log(π*(y|x)/πref(y|x)) + β log Z(x). Now plug this into the Bradley-Terry preference model:

🧮 DPO Derivation — Step by Step

Start with

P(y_w\succ y_l\mid x)=\sigma\!\left(r(x,y_w)-r(x,y_l)\right)

Bradley-Terry model

Substitute r(x,y)

r(x,y_w)-r(x,y_l)=\beta\log\frac{\pi^*(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\beta\log\frac{\pi^*(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}+\underbrace{\beta\log Z(x)-\beta\log Z(x)}_{=\,0}

Z(x) cancels!

MLE over dataset

\mathcal{L}_{DPO}(\pi_\theta;\pi_\mathrm{ref})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}\right)\right]

Final DPO loss

✨ The Magic: Z(x) Cancels

Z(x) = ∑_y π_ref(y|x) exp(r(x,y)/β) is intractable — you'd need to sum over all possible responses. But in the Bradley-Terry model, only the difference of rewards matters. Since Z(x) is the same for y_w and y_l (same prompt), it cancels exactly. This makes DPO computationally trivial.

What DPO Actually Does

Taking the gradient of the DPO loss shows that training simultaneously:

Increases log π_θ(y_w|x) — chosen response gets more probability mass
Decreases log π_θ(y_l|x) — rejected response gets less probability mass
Scales by error — updates are larger when the model currently assigns higher probability to the rejected response
Uses log-ratios — implicitly regularizes toward the reference policy

Figure 4 — DPO Training Dynamics

During DPO training: the log-ratio for the chosen response (green bar) increases above zero (above reference) while the rejected response (red bar) decreases below zero (below reference). The margin between them widens, making the preference more confident. Both start at 0 (identical to reference policy at initialization).

✅ DPO Advantages

Only 2 models (policy + reference). Simple cross-entropy loss. Stable training. No reward model training needed. No RL loop. Much less memory than PPO. Easy to implement with TRL in <20 lines.

⚠️ DPO Limitations

Offline: fixed dataset, no exploration. Requires preference pairs (harder to collect than scalar rewards). Can overfit to data distribution. Distribution shift when reference is stale. Can degrade the quality of preferred responses under certain conditions.

DPO Training Code (TRL)

dpo_text.py

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch

model = AutoModelForCausalLM.from_pretrained(
    "your-sft-llm-model", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("your-sft-llm-model")

# Dataset format: prompt + chosen + rejected
# For text: auto-generate pairs by scoring SFT outputs with your reward function
# chosen  = highest-scoring generated sequence (or ground truth)
# rejected = lowest-scoring generated sequence
dataset = Dataset.from_dict({
    "prompt":   ["Explain the trade-offs between PPO and DPO."] * N,
    "chosen":   ["A clear, well-cited two-paragraph answer..."] * N,  # high score
    "rejected": ["A short, vague answer with no citations..."]   * N,  # low score
})

config = DPOConfig(
    output_dir="./dpo-llm",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,      # conservative LR for DPO
    beta=0.1,                # KL coefficient beta
    loss_type="sigmoid",     # standard DPO; alternatives: "ipo", "hinge"
    max_length=2048,
    bf16=True,
    gradient_checkpointing=True,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # TRL auto-creates frozen copy of model
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

# Key metrics to monitor in wandb/tensorboard:
# rewards/chosen (should increase), rewards/rejected (should decrease)
# rewards/margins = chosen - rejected (should widen)
# logps/chosen, logps/rejected (log probabilities)
# kl divergence (should stay bounded, watch for spikes)

GRPO — DeepSeek's Modern Workhorse

GRPO (Group Relative Policy Optimization) was introduced in DeepSeekMath [Shao et al., 2024] and became the core of DeepSeek-R1 [DeepSeek-AI, 2025]. It achieves something remarkable: PPO-level online exploration without a critic model.

The Problem GRPO Solves

PPO's critic V_ψ(s_t) estimates expected future reward from each token position. For LLMs this is:

A full LLM-sized model in memory — doubles the compute cost
Notoriously hard to train: text-level value estimates are noisy
The main source of instability and divergence in PPO

GRPO's key insight: If you generate a group of G responses to the same prompt and score them all, the group mean reward is a natural, critic-free baseline estimate. The advantage of each response is simply how much better or worse it is than the group average.

The GRPO Algorithm

Sample G responses from the current policy

For each prompt x, sample: {y₁, y₂, ..., y_G} ~ π_old(·|x). Typical G = 4–16. Higher G = more stable advantage estimates.

Score all G responses

Compute rewards: {r₁, ..., r_G} using your reward function (learned RM or rule-based). For text: validity + accuracy + efficiency.

Normalize rewards within the group

Compute group statistics and normalize. This is the critic replacement: responses above the group mean get positive advantage; below get negative.

PPO-style clipped update

Apply the standard PPO clip objective using the group-normalized advantages. No value model, no GAE, no separate critic training.

\hat{A}_i=\frac{r_i-\mu_r}{\sigma_r},\qquad\mu_r=\frac{1}{G}\sum_{j=1}^G r_j,\quad\sigma_r=\sqrt{\frac{1}{G}\sum_{j=1}^G(r_j-\mu_r)^2}

GRPO advantage — group-normalized reward. Critic-free. Simple. Effective.

\mathcal{L}_{GRPO}(\theta)=-\frac{1}{G}\sum_{i=1}^G\!\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_\mathrm{old}(y_i|x)}\hat{A}_i,\;\mathrm{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_\mathrm{old}(y_i|x)},1-\varepsilon,1+\varepsilon\right)\hat{A}_i\right)-\beta\,D_{KL}\!\left(\pi_\theta\|\pi_\mathrm{ref}\right)\right]

GRPO objective — identical to PPO-CLIP but advantages come from group normalization, not a critic

Figure 5 — GRPO Group Sampling & Advantage Normalization

For each prompt, G=8 responses are sampled and scored. Rewards are normalized within the group (z-score). Green bars have positive advantage (above mean → increase their probability). Red bars have negative advantage (below mean → decrease probability). The mean line (amber dashes) is the critic-free baseline. Click "Resample" to simulate a new group.

Rule-Based Rewards: GRPO's Superpower

🔧 Why GRPO + Rule-Based = Perfect for text

DeepSeek-R1's key finding: rule-based rewards (math correctness, code execution) outperform learned reward models for domains with verifiable answers. text generation is exactly this domain:

Validity (binary): Does the response sequence parse and execute? Does it produce a valid answer body?
Answer accuracy (continuous): reward = exp(−α · answer_score(generated, target))
Topological match: Does the number of snippets/sources/results match expectations?
Efficiency (continuous): reward = max(0, 1 − num_ops / max_ops)

Rule-based rewards cannot be gamed the way learned reward models can. This is why GRPO with rule-based rewards outperformed full RLHF in DeepSeek-R1.

GRPO Training Code (TRL)

grpo_llm.py

from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, re, math

# ── text Reward Functions (rule-based, no learned RM needed) ──────────────────

VALID_OPS = {"query","read_doc","revolve","citation","check_answer","boolean_union","boolean_subtract","loft","sweep"}

def parse_ops(text):
    return re.findall(r'(\w+)\([^)]*\)', text)

def validity_reward(completion, **kwargs):
    ops = parse_ops(completion)
    if not ops: return 0.0
    if any(op not in VALID_OPS for op in ops): return 0.0
    return 1.0

def text_reward(completion, reference_text=None, **kwargs):
    ops = parse_ops(completion)
    if not ops or reference_text is None: return 0.5
    try:
        output = run_model(ops)          # your sandbox
        cd   = answer_score(output, reference_text)
        return float(math.exp(-2.0 * cd))   # 1.0 = perfect, 0.0 = far off
    except Exception:
        return 0.0

def efficiency_reward(completion, **kwargs):
    ops = parse_ops(completion)
    return max(0.0, 1.0 - len(ops) / 50.0)  # penalize > 50 operations

def text_reward_fn(completions, prompts, **kwargs):
    """Main GRPO reward function. Returns list[float], one per completion."""
    rewards = []
    for i, completion in enumerate(completions):
        r_valid = validity_reward(completion)
        r_geom  = text_reward(completion, **kwargs)
        r_eff   = efficiency_reward(completion)
        rewards.append(0.4 * r_valid + 0.5 * r_geom + 0.1 * r_eff)
    return rewards

# ── Model + Config ────────────────────────────────────────────────────────────

model     = AutoModelForCausalLM.from_pretrained("your-dpo-llm-model", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("your-dpo-llm-model")

config = GRPOConfig(
    output_dir="./grpo-llm",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-7,       # lower than DPO — online updates are noisier
    num_generations=8,        # G: group size. Higher = more stable, more compute
    max_new_tokens=1024,
    temperature=0.9,          # need diversity in the group; don't set too low
    kl_coef=0.04,             # beta for KL penalty
    cliprange=0.2,            # epsilon for PPO clip
    bf16=True,
    gradient_checkpointing=True,
    save_steps=200,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    reward_funcs=text_reward_fn,   # accepts list of reward functions too
    args=config,
    train_dataset=train_dataset,  # just needs "prompt" column
    processing_class=tokenizer,
)

# Key training metrics:
# reward/mean:   group average reward (should increase steadily)
# reward/std:    diversity within group (if -> 0: mode collapse)
# policy/kl:     KL from reference (should stay < 0.5; if spikes: reduce lr)
# policy/entropy: generation diversity (should stay positive)
trainer.train()

The Algorithm Zoo

Beyond PPO, DPO, and GRPO, there's a rapidly expanding ecosystem. Here's the complete map.

Algorithm Family Tree

Comparison Table

Method	Models Needed	Data Type	Online?	Key Formula Idea	Best For
PPO	4× model	Rollouts	Yes	clip(r,1±ε)·A	Complex RM, chat alignment
DPO	2× model	Pref. pairs	No	−log σ(β·Δlog-ratio)	Stable offline training, clean data
GRPO	2× model	Rollouts	Yes	PPO clip + group norm A	Verifiable rewards, reasoning, text ⭐
IPO	2× model	Pref. pairs	No	∥β·Δlog-ratio − 1∥²	Avoids DPO degenerate solutions
KTO	2× model	Unpaired labels	No	Kahneman-Tversky value fn	No paired data; just (good/bad) labels
ORPO	1× model	Pref. pairs	No	SFT + λ·odds-ratio	Minimal memory; SFT + alignment in one pass
SimPO	1× model	Pref. pairs	No	−log σ(β·Δavg-logprob − γ)	No reference model, better length handling

Key Equations for IPO and ORPO

\mathcal{L}_{IPO}=\mathbb{E}\!\left[\left(\log\frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\log\frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}-\frac{1}{2\beta}\right)^{\!2}\right]

IPO [Azar et al. 2023] — identity loss avoids DPO's degenerate deterministic solution

\mathcal{L}_{ORPO}=\mathcal{L}_{SFT}-\lambda\,\mathbb{E}\!\left[\log\sigma\!\left(\log\frac{\mathrm{odds}_\theta(y_w|x)}{\mathrm{odds}_\theta(y_l|x)}\right)\right],\quad\mathrm{odds}_\theta(y|x)=\frac{\pi_\theta(y|x)}{1-\pi_\theta(y|x)}

ORPO [Hong et al. 2024] — no reference model needed; odds ratio as relative preference signal

Practical Guide for Your text Model

Theory done. Here is a concrete action plan for post-training your multimodal LLM.

Which Algorithm Should You Use?

✅ Recommendation: DPO first, then GRPO

Phase 1 — DPO: Collect preference pairs from your SFT model. Generate 2–4 outputs per training image, score with your reward function, take best/worst as chosen/rejected. Train DPO. Fast to set up, very stable, good baseline.

Phase 2 — GRPO: Once DPO baseline is answer, switch to GRPO with your rule-based text reward. GRPO explores online and will push beyond what offline DPO can achieve. With verifiable text rewards (validity + answer score), this is the strongest approach available.

Training Decision Tree

Do you have automated/verifiable rewards?

Can you programmatically evaluate correctness? (output validity, math, code execution) → YES: Use GRPO. Rule-based rewards outperform learned RMs. You cannot beat them on your own domain.

Only pairwise preference data, no automated reward?

Human raters pick A vs B → DPO or ORPO. Very limited memory (1 GPU)? → ORPO (1 model). Can't collect pairs, only (good/bad) labels? → KTO.

Memory constrained?

Running on 1 GPU? → ORPO or SimPO (1× model). Want online but cheap? → GRPO with LoRA on the policy (ref model stays full precision).

Full Two-Phase Pipeline

full_pipeline.py

"""
Full text Post-Training: Phase 1 (DPO) + Phase 2 (GRPO)
Tested with trl >= 0.12 and transformers >= 4.45
"""
from trl import DPOTrainer, DPOConfig, GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch, re, math

# ═══════════════════════════════════════════════════════
# REWARD FUNCTIONS (used in both phases)
# ═══════════════════════════════════════════════════════
VALID_OPS = {"query","read_doc","revolve","citation","check_answer",
             "boolean_union","boolean_subtract","loft","sweep"}

def text_reward(completions, prompts=None, **kwargs):
    rewards = []
    for completion in completions:
        ops = re.findall(r'(\w+)\([^)]*\)', completion)
        if not ops:
            rewards.append(0.0); continue
        valid     = all(op in VALID_OPS for op in ops)
        r_valid   = 1.0 if valid else 0.0
        r_quality = quality_score(text, kwargs.get("target"))   # your impl
        r_eff     = max(0.0, 1.0 - len(ops) / 50.0)
        rewards.append(0.4 * r_valid + 0.5 * r_geom + 0.1 * r_eff)
    return rewards

# ═══════════════════════════════════════════════════════
# PHASE 0: Build DPO Dataset from SFT outputs
# ═══════════════════════════════════════════════════════
def build_preference_data(sft_model, tokenizer, questions, n_per_prompt=4):
    chosen, rejected, prompts = [], [], []
    for q in questions:
        prompt  = f"{q.text}\nAnswer:"
        outputs = [sft_model.generate(prompt, do_sample=True, temp=0.9)
                   for _ in range(n_per_prompt)]
        scores  = text_reward(outputs, target=q.reference)
        best, worst = max(range(n_per_prompt), key=lambda i: scores[i]), \
                      min(range(n_per_prompt), key=lambda i: scores[i])
        if scores[best] > scores[worst] + 0.1:   # only add meaningful gaps
            prompts.append(prompt)
            chosen.append(outputs[best])
            rejected.append(outputs[worst])
    return Dataset.from_dict({"prompt": prompts, "chosen": chosen, "rejected": rejected})

# ═══════════════════════════════════════════════════════
# PHASE 1: DPO
# ═══════════════════════════════════════════════════════
def phase1_dpo(sft_model_path, dpo_dataset):
    model     = AutoModelForCausalLM.from_pretrained(sft_model_path, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(sft_model_path)
    config    = DPOConfig(
        output_dir="ckpt/dpo", num_train_epochs=2,
        per_device_train_batch_size=2, gradient_accumulation_steps=8,
        learning_rate=5e-7, beta=0.1, max_length=2048,
        bf16=True, gradient_checkpointing=True, warmup_ratio=0.05,
    )
    DPOTrainer(model=model, ref_model=None, args=config,
               train_dataset=dpo_dataset, processing_class=tokenizer).train()
    model.save_pretrained("ckpt/dpo-final")
    return "ckpt/dpo-final"

# ═══════════════════════════════════════════════════════
# PHASE 2: GRPO
# ═══════════════════════════════════════════════════════
def phase2_grpo(dpo_ckpt, train_dataset):
    model     = AutoModelForCausalLM.from_pretrained(dpo_ckpt, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(dpo_ckpt)
    config    = GRPOConfig(
        output_dir="ckpt/grpo", num_train_epochs=3,
        per_device_train_batch_size=1, gradient_accumulation_steps=16,
        learning_rate=2e-7, num_generations=8, max_new_tokens=1024,
        temperature=0.9, kl_coef=0.04, cliprange=0.2,
        bf16=True, gradient_checkpointing=True, save_steps=200,
    )
    GRPOTrainer(model=model, reward_funcs=text_reward, args=config,
                train_dataset=train_dataset, processing_class=tokenizer).train()

if __name__ == "__main__":
    sft_ckpt   = "your-sft-model"
    images     = load_text_dataset()               # your data loader
    dpo_data   = build_preference_data(AutoModelForCausalLM.from_pretrained(sft_ckpt),
                                       AutoTokenizer.from_pretrained(sft_ckpt), images)
    dpo_ckpt   = phase1_dpo(sft_ckpt, dpo_data)
    phase2_grpo(dpo_ckpt, images)

Hyperparameter Starting Points

Setting	DPO	GRPO	Notes
Learning rate	5e-7	2e-7	DPO is offline; GRPO is noisier so go lower
β (KL coeff)	0.1	0.04	Too high = no learning. Too low = KL collapse. Start here.
Batch size	16 effective	16 effective	Use grad accum to hit this on small GPUs
Group size G	N/A	8	Increase to 16 for complex text tasks for more stable advantages
Temperature	N/A	0.9	Need diversity in the group. Don't go below 0.7.
Epochs	2	3	DPO overfits quickly; GRPO benefits from more steps
Max tokens	2048	1024	Keep generated length bounded to control cost

Common Pitfalls & Red Flags

🚨 KL Collapse

KL divergence spikes, generation quality drops, outputs become repetitive. Fix: Reduce LR, increase β, check reward normalization (rewards at wildly different scales cause this).

🚨 Reward Hacking

Reward goes up but actual quality drops. Fix: Add more reward components, use multiple complementary rewards, reduce group size in GRPO.

🚨 Mode Collapse

All G responses in GRPO group become identical (reward/std → 0). Fix: Increase sampling temperature, reduce kl_coef, add an explicit diversity reward.

🚨 Sparse Reward Signal

All rewards = 0 or 1, no gradient signal in between. Fix: Use smooth rewards like exp(−α·cd) instead of binary thresholds. Add partial credit for syntactically valid but semantically wrong outputs.

REF

References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438
Stiennon, N., Ouyang, L., et al. (2020). Learning to Summarize from Human Feedback. NeurIPS 2020. arXiv:2009.01325
Ouyang, L., Wu, J., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS 2022. arXiv:2203.02155
Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290
Shao, Z., Wang, P., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948
Azar, M. G., et al. (2023). A General Theoretical Paradigm to Understand Learning from Human Feedback. arXiv:2310.12036
Ethayarajh, K., et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306
Hong, J., et al. (2024). ORPO: Monolithic Preference Optimization without Reference Model. arXiv:2403.07691
Meng, Y., et al. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv:2405.14734
Williams, R.J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE). Machine Learning, 8, 229–256.
Bradley, R.A. & Terry, M.E. (1952). Rank Analysis of Incomplete Block Designs I: The Method of Paired Comparisons. Biometrika, 39(3/4), 324–345.