🎓 Complete Technical Deep-Dive

Post-Training LLMs:
From RL Intuition to GRPO

A no-nonsense, math-heavy, code-heavy guide to every major post-training algorithm — and how to apply them to your LLM.

📖 ~50 min read🧮 20+ equations💻 5 code snippets🎬 5 animations
Post-Training Guide overview
PPO
Proximal Policy Optimization
DPO
Direct Preference Optimization
GRPO
Group Relative Policy Opt.
RLHF
RL from Human Feedback
1

Why Post-Training?

Language models are trained in phases. Understanding what each phase does — and what it cannot do — is the foundation for everything that follows.

The Three Phases

1
Pre-training — "Learn the world"
Trained on trillions of tokens. Learns syntax, semantics, factual knowledge, multimodal representations. Objective: next-token prediction. Output: capable but raw foundation model.
2
Supervised Fine-Tuning (SFT) — "Learn the task"
Fine-tune on curated (input, output) demonstration pairs. For text: (image, response_text) pairs. Problem: SFT is mode-averaging — it learns the average of all demonstrations, not the best ones. Also suffers from distribution drift and no exploration.
3
Post-Training / RLHF — "Learn to be good"
Use reinforcement learning to shift the model's distribution toward high-quality outputs. Instead of imitating demonstrations, the model explores, gets feedback, and maximizes reward. This is where real alignment happens.
🔧 For Your text Model

SFT teaches your model the task format, but produces the average of your training set — not the most valid or accurate. Post-training with computable rewards (output validity, answer score, operation efficiency) pushes it toward genuinely good outputs. Crucially, your rewards are automated and exact — a massive advantage over domains that need human raters.

Timeline

2017
PPO published
Schulman et al. — stable on-policy RL with clipping. Becomes the workhorse.
2022
InstructGPT
Ouyang et al. scale RLHF+PPO to GPT-3. The 3-phase pipeline (SFT → RM → PPO) is established for production.
2023
DPO
Rafailov et al. — skip the reward model. Optimize preferences directly with a closed-form cross-entropy loss.
2024–25
GRPO + DeepSeek-R1
Shao et al. and DeepSeek-AI — group-relative advantages + rule-based rewards. SOTA reasoning, no learned reward model.
2

Reinforcement Learning — Core Intuition

RL trains an agent (your LLM) to take actions to maximize cumulative reward. For language models, this maps cleanly onto token generation.

The LLM-as-RL Mapping

Classic RL

  • Policy π — agent's decision rule
  • State s — current situation
  • Action a — choice made
  • Reward r — feedback signal

For LLMs

  • Policy πθ — the language model
  • State st — prompt + tokens so far
  • Action at — next token to generate
  • Reward r — scalar at end of generation

The Master RLHF Objective

The naive objective (maximize E[r]) causes reward hacking. The solution: add a KL divergence penalty that keeps the trained policy close to the reference (SFT) policy:

\max_{\pi_\theta}\;\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot|x)}\!\left[r_\phi(x,y)\right]\;-\;\beta\underbrace{\mathbb{E}_{x}\!\left[D_{KL}\!\left(\pi_\theta(\cdot|x)\,\big\|\,\pi_{\mathrm{ref}}(\cdot|x)\right)\right]}_{\text{stay close to reference}}
Master RLHF objective — β ≈ 0.01–0.2 balances reward vs. reference fidelity
💡 Why KL Divergence?

KL(P∥Q) = EP[log P − log Q] ≥ 0, equals zero only when P = Q. It penalizes distributions that diverge too much from the reference. Without this constraint, the model exploits weaknesses in the reward model and degenerates into gibberish that scores high but is useless.

The Closed-Form Optimal Policy

Setting the functional gradient to zero gives the optimal policy in closed form — this equation is the cornerstone of DPO's derivation:

\pi^*(y\mid x)=\frac{1}{Z(x)}\,\pi_{\mathrm{ref}}(y\mid x)\,\exp\!\left(\frac{r(x,y)}{\beta}\right),\qquad Z(x)=\sum_y\pi_{\mathrm{ref}}(y|x)\exp\!\left(\tfrac{r(x,y)}{\beta}\right)
Optimal policy — reference policy re-weighted by exponentiated reward. Z(x) is the intractable partition function.
Figure 1 — Policy Distribution Shift During RL Training
The answer blue curve is the trained policy's probability distribution over response quality. It shifts right (toward better outputs) and sharpens as training proceeds. The dashed grey curve is the frozen reference policy. The policy learns to put more mass on high-reward outputs while the KL constraint prevents complete drift.

Rearranging for the Reward Signal

Taking the log of the optimal policy expression and rearranging to solve for r(x,y):

r(x,y)=\beta\log\frac{\pi^*(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}+\underbrace{\beta\log Z(x)}_{\text{only depends on }x}
The reward expressed in terms of the policy ratio — Z(x) will cancel in DPO's derivation!
🔧 text Reward Design

Your reward can combine: validity (does the response sequence compile and produce a valid answer?), answer accuracy (answer score or F1 vs. target answer), and efficiency (fewer operations). These are all computable without human raters — making GRPO with rule-based rewards the ideal choice for your use case.

3

Reward Modeling

Most RLHF pipelines need a reward model: a neural network scoring a (prompt, response) pair. This section explains how it's built — and when you can skip it entirely.

Preference Data & Bradley-Terry

Instead of absolute ratings (hard, noisy), we collect pairwise preferences: given two responses yw (winner) and yl (loser) to prompt x, which is better? The Bradley-Terry model [1952] converts scores to preference probabilities:

P(y_w\succ y_l\mid x)=\sigma\!\left(r(x,y_w)-r(x,y_l)\right)
Bradley-Terry — only score differences matter; absolute scale is arbitrary

Training the reward model rφ(x,y) via maximum likelihood on preference pairs:

\mathcal{L}_{RM}(\phi)=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(r_\phi(x,y_w)-r_\phi(x,y_l)\right)\right]
Reward model loss — binary cross-entropy on pairwise preferences. Monitor preference accuracy (should reach ~70–90%)

RLHF Pipeline

Figure 2 — The Complete RLHF Pipeline
PROMPT x ∈ 𝒟 POLICY LLM πθ(·|x) being trained RESPONSE y ~ πθ REWARD MODEL rφ(x,y) or rule-based ★ OPTIMIZER PPO / DPO GRPO / etc. gradient update → policy improves → loop repeats
The RLHF loop. PPO requires a learned reward model (purple box). DPO bypasses it. GRPO replaces the scalar reward with a group-normalized advantage, working with either a learned RM or rule-based functions.
4

PPO — The Classic Workhorse

PPO [Schulman et al., 2017] was the dominant RLHF algorithm until DPO arrived in 2023. Understand it deeply because all subsequent methods are either simplifications of or reactions to PPO's problems.

From REINFORCE to Actor-Critic

The fundamental algorithm is policy gradient. REINFORCE computes:

\nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\!\left[R(\tau)\,\nabla_\theta\log\pi_\theta(\tau)\right]
REINFORCE — unbiased but extremely high variance. Unusable without a baseline.

Subtract a learned value function baseline V(st) to get the advantage At = Q(st,at) − V(st). Estimate advantage using Generalized Advantage Estimation (GAE):

\hat{A}_t^{GAE(\gamma,\lambda)}=\sum_{l=0}^{T-t-1}(\gamma\lambda)^l\,\delta_{t+l},\qquad\delta_t=r_t+\gamma V(s_{t+1})-V(s_t)
GAE [Schulman 2015] — λ=0 gives TD(0) (low var, biased); λ=1 gives Monte Carlo (unbiased, high var)

The PPO Clipping Mechanism

Vanilla policy gradient takes large destabilizing steps. PPO clips the probability ratio to a trust region:

r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\mathrm{old}}(a_t\mid s_t)}
L^{CLIP}(\theta)=\mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\;\operatorname{clip}\!\left(r_t(\theta),1-\varepsilon,1+\varepsilon\right)\hat{A}_t\right)\right]
PPO-CLIP objective — ε = 0.2 typical. The min() takes the more conservative (pessimistic) of the two estimates.
Figure 3 — PPO Clipping Mechanism
Left: when advantage A > 0 (good action), we want to increase probability but cap the gain at ratio 1+ε. Right: when A < 0 (bad action), we want to decrease probability but floor the loss at ratio 1−ε. The grey shaded regions are clipped — no gradient flows from there. This prevents overconfident, destabilizing updates.
💡 Intuition for min()

When A > 0: rt·A says "increase this action's probability." The clip stops us at (1+ε)·A — we can't be too sure. When A < 0: rt·A says "decrease this action's probability." The clip floors us at (1−ε)·A — we can't be too punishing. The min() is pessimistic: always take the smaller (safer) objective value.

PPO for LLMs: The 4-Model Problem

1
Policy model (actor) πθ — being trained
Generates responses. Receives gradient updates each step.
2
Reference model πref — frozen
Frozen copy of the initial SFT model. Computes per-token KL penalty. Never receives gradients.
3
Reward model rφ — frozen
Scores complete responses with a scalar. Trained separately before RL. Frozen during RL.
4
Value model (critic) Vψ — trained simultaneously
Estimates expected future reward from each token position. Must be trained live alongside the policy. This is the most expensive, most unstable, most painful model in the whole pipeline.

The KL penalty is applied as a per-token reward injection:

\tilde{r}_t=\begin{cases}r_\phi(x,y)-\beta\sum_{t'=1}^T\log\frac{\pi_\theta(a_{t'}|s_{t'})}{\pi_{\mathrm{ref}}(a_{t'}|s_{t'})}&t=T\\0&t
Reward with KL penalty injected at the final token. Intermediate tokens get 0 reward.
⚠️ The Cost of PPO

For a 7B model: 4 × 7B = 28B parameters in VRAM, plus optimizer states, activations, rollout buffers. The critic is notoriously hard to train — if V(s) estimates are wrong, advantages are noisy and the policy diverges. This is the primary motivation for both DPO and GRPO.

PPO Training Code (TRL)

 ppo_training.py
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch

# 4 models needed for PPO
model   = AutoModelForCausalLMWithValueHead.from_pretrained("your-sft-model")  # policy + critic
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("your-sft-model")  # reference (frozen)
tokenizer = AutoTokenizer.from_pretrained("your-sft-model")
tokenizer.pad_token = tokenizer.eos_token
reward_pipe = pipeline("text-classification", model="your-reward-model")  # reward model

config = PPOConfig(
    output_dir="./ppo-output",
    learning_rate=1.4e-5,
    batch_size=16,
    mini_batch_size=4,
    ppo_epochs=4,       # PPO epochs per rollout batch
    lam=0.95,           # GAE lambda
    cliprange=0.2,      # epsilon in PPO clip
    init_kl_coef=0.2,   # beta (KL penalty coefficient)
    target_kl=0.1,      # stop updates if KL exceeds this
)

trainer = PPOTrainer(config=config, model=model, ref_model=ref_model, tokenizer=tokenizer)

for batch in trainer.dataloader:
    query_tensors = [b for b in batch["input_ids"]]
    # Generate: sample from current policy
    response_tensors = trainer.generate(query_tensors, max_new_tokens=512, temperature=0.7)
    responses = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
    # Score with reward model
    rewards = [torch.tensor(r["score"]) for r in reward_pipe(responses)]
    # PPO update (computes advantages using value model internally)
    stats = trainer.step(query_tensors, response_tensors, rewards)
    trainer.log_stats(stats, batch, rewards)
5

DPO — The Elegant Simplification

In 2023, Rafailov et al. published an insight so clean it almost feels like a trick: you don't need a reward model or a RL loop. The reward is already implicit in the ratio between trained and reference policy.

The Key Insight: Z(x) Cancels

Recall from Section 2: r(x,y) = β log(π*(y|x)/πref(y|x)) + β log Z(x). Now plug this into the Bradley-Terry preference model:

🧮 DPO Derivation — Step by Step
Start with
P(y_w\succ y_l\mid x)=\sigma\!\left(r(x,y_w)-r(x,y_l)\right)
Bradley-Terry model
Substitute r(x,y)
r(x,y_w)-r(x,y_l)=\beta\log\frac{\pi^*(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\beta\log\frac{\pi^*(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}+\underbrace{\beta\log Z(x)-\beta\log Z(x)}_{=\,0}
Z(x) cancels!
MLE over dataset
\mathcal{L}_{DPO}(\pi_\theta;\pi_\mathrm{ref})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}\right)\right]
Final DPO loss
✨ The Magic: Z(x) Cancels

Z(x) = ∑y πref(y|x) exp(r(x,y)/β) is intractable — you'd need to sum over all possible responses. But in the Bradley-Terry model, only the difference of rewards matters. Since Z(x) is the same for yw and yl (same prompt), it cancels exactly. This makes DPO computationally trivial.

What DPO Actually Does

Taking the gradient of the DPO loss shows that training simultaneously:

Figure 4 — DPO Training Dynamics
During DPO training: the log-ratio for the chosen response (green bar) increases above zero (above reference) while the rejected response (red bar) decreases below zero (below reference). The margin between them widens, making the preference more confident. Both start at 0 (identical to reference policy at initialization).
✅ DPO Advantages

Only 2 models (policy + reference). Simple cross-entropy loss. Stable training. No reward model training needed. No RL loop. Much less memory than PPO. Easy to implement with TRL in <20 lines.

⚠️ DPO Limitations

Offline: fixed dataset, no exploration. Requires preference pairs (harder to collect than scalar rewards). Can overfit to data distribution. Distribution shift when reference is stale. Can degrade the quality of preferred responses under certain conditions.

DPO Training Code (TRL)

 dpo_text.py
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch

model = AutoModelForCausalLM.from_pretrained(
    "your-sft-llm-model", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("your-sft-llm-model")

# Dataset format: prompt + chosen + rejected
# For text: auto-generate pairs by scoring SFT outputs with your reward function
# chosen  = highest-scoring generated sequence (or ground truth)
# rejected = lowest-scoring generated sequence
dataset = Dataset.from_dict({
    "prompt":   ["Explain the trade-offs between PPO and DPO."] * N,
    "chosen":   ["A clear, well-cited two-paragraph answer..."] * N,  # high score
    "rejected": ["A short, vague answer with no citations..."]   * N,  # low score
})

config = DPOConfig(
    output_dir="./dpo-llm",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,      # conservative LR for DPO
    beta=0.1,                # KL coefficient beta
    loss_type="sigmoid",     # standard DPO; alternatives: "ipo", "hinge"
    max_length=2048,
    bf16=True,
    gradient_checkpointing=True,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # TRL auto-creates frozen copy of model
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

# Key metrics to monitor in wandb/tensorboard:
# rewards/chosen (should increase), rewards/rejected (should decrease)
# rewards/margins = chosen - rejected (should widen)
# logps/chosen, logps/rejected (log probabilities)
# kl divergence (should stay bounded, watch for spikes)
6

GRPO — DeepSeek's Modern Workhorse

GRPO (Group Relative Policy Optimization) was introduced in DeepSeekMath [Shao et al., 2024] and became the core of DeepSeek-R1 [DeepSeek-AI, 2025]. It achieves something remarkable: PPO-level online exploration without a critic model.

The Problem GRPO Solves

PPO's critic Vψ(st) estimates expected future reward from each token position. For LLMs this is:

GRPO's key insight: If you generate a group of G responses to the same prompt and score them all, the group mean reward is a natural, critic-free baseline estimate. The advantage of each response is simply how much better or worse it is than the group average.

The GRPO Algorithm

1
Sample G responses from the current policy
For each prompt x, sample: {y1, y2, ..., yG} ~ πold(·|x). Typical G = 4–16. Higher G = more stable advantage estimates.
2
Score all G responses
Compute rewards: {r1, ..., rG} using your reward function (learned RM or rule-based). For text: validity + accuracy + efficiency.
3
Normalize rewards within the group
Compute group statistics and normalize. This is the critic replacement: responses above the group mean get positive advantage; below get negative.
4
PPO-style clipped update
Apply the standard PPO clip objective using the group-normalized advantages. No value model, no GAE, no separate critic training.
\hat{A}_i=\frac{r_i-\mu_r}{\sigma_r},\qquad\mu_r=\frac{1}{G}\sum_{j=1}^G r_j,\quad\sigma_r=\sqrt{\frac{1}{G}\sum_{j=1}^G(r_j-\mu_r)^2}
GRPO advantage — group-normalized reward. Critic-free. Simple. Effective.
\mathcal{L}_{GRPO}(\theta)=-\frac{1}{G}\sum_{i=1}^G\!\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_\mathrm{old}(y_i|x)}\hat{A}_i,\;\mathrm{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_\mathrm{old}(y_i|x)},1-\varepsilon,1+\varepsilon\right)\hat{A}_i\right)-\beta\,D_{KL}\!\left(\pi_\theta\|\pi_\mathrm{ref}\right)\right]
GRPO objective — identical to PPO-CLIP but advantages come from group normalization, not a critic
Figure 5 — GRPO Group Sampling & Advantage Normalization
For each prompt, G=8 responses are sampled and scored. Rewards are normalized within the group (z-score). Green bars have positive advantage (above mean → increase their probability). Red bars have negative advantage (below mean → decrease probability). The mean line (amber dashes) is the critic-free baseline. Click "Resample" to simulate a new group.

Rule-Based Rewards: GRPO's Superpower

🔧 Why GRPO + Rule-Based = Perfect for text

DeepSeek-R1's key finding: rule-based rewards (math correctness, code execution) outperform learned reward models for domains with verifiable answers. text generation is exactly this domain:

  • Validity (binary): Does the response sequence parse and execute? Does it produce a valid answer body?
  • Answer accuracy (continuous): reward = exp(−α · answer_score(generated, target))
  • Topological match: Does the number of snippets/sources/results match expectations?
  • Efficiency (continuous): reward = max(0, 1 − num_ops / max_ops)

Rule-based rewards cannot be gamed the way learned reward models can. This is why GRPO with rule-based rewards outperformed full RLHF in DeepSeek-R1.

GRPO Training Code (TRL)

 grpo_llm.py
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, re, math

# ── text Reward Functions (rule-based, no learned RM needed) ──────────────────

VALID_OPS = {"query","read_doc","revolve","citation","check_answer","boolean_union","boolean_subtract","loft","sweep"}

def parse_ops(text):
    return re.findall(r'(\w+)\([^)]*\)', text)

def validity_reward(completion, **kwargs):
    ops = parse_ops(completion)
    if not ops: return 0.0
    if any(op not in VALID_OPS for op in ops): return 0.0
    return 1.0

def text_reward(completion, reference_text=None, **kwargs):
    ops = parse_ops(completion)
    if not ops or reference_text is None: return 0.5
    try:
        output = run_model(ops)          # your sandbox
        cd   = answer_score(output, reference_text)
        return float(math.exp(-2.0 * cd))   # 1.0 = perfect, 0.0 = far off
    except Exception:
        return 0.0

def efficiency_reward(completion, **kwargs):
    ops = parse_ops(completion)
    return max(0.0, 1.0 - len(ops) / 50.0)  # penalize > 50 operations

def text_reward_fn(completions, prompts, **kwargs):
    """Main GRPO reward function. Returns list[float], one per completion."""
    rewards = []
    for i, completion in enumerate(completions):
        r_valid = validity_reward(completion)
        r_geom  = text_reward(completion, **kwargs)
        r_eff   = efficiency_reward(completion)
        rewards.append(0.4 * r_valid + 0.5 * r_geom + 0.1 * r_eff)
    return rewards

# ── Model + Config ────────────────────────────────────────────────────────────

model     = AutoModelForCausalLM.from_pretrained("your-dpo-llm-model", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("your-dpo-llm-model")

config = GRPOConfig(
    output_dir="./grpo-llm",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-7,       # lower than DPO — online updates are noisier
    num_generations=8,        # G: group size. Higher = more stable, more compute
    max_new_tokens=1024,
    temperature=0.9,          # need diversity in the group; don't set too low
    kl_coef=0.04,             # beta for KL penalty
    cliprange=0.2,            # epsilon for PPO clip
    bf16=True,
    gradient_checkpointing=True,
    save_steps=200,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    reward_funcs=text_reward_fn,   # accepts list of reward functions too
    args=config,
    train_dataset=train_dataset,  # just needs "prompt" column
    processing_class=tokenizer,
)

# Key training metrics:
# reward/mean:   group average reward (should increase steadily)
# reward/std:    diversity within group (if -> 0: mode collapse)
# policy/kl:     KL from reference (should stay < 0.5; if spikes: reduce lr)
# policy/entropy: generation diversity (should stay positive)
trainer.train()
7

The Algorithm Zoo

Beyond PPO, DPO, and GRPO, there's a rapidly expanding ecosystem. Here's the complete map.

Algorithm Family Tree

RLHF MASTER OBJECTIVE max E[r] − β KL ON-POLICY + CRITIC PPO (2017) ON-POLICY, NO CRITIC GRPO (2024) SIMPLIFIED REINFORCE++ (2024) OFFLINE PREFERENCES DPO (2023) REGULARIZED LOSS IPO (2023) NO REF MODEL ORPO (2024) LENGTH-NORM SimPO (2024) (independent lineage) KTO (2024)

Comparison Table

MethodModels NeededData TypeOnline?Key Formula IdeaBest For
PPO4× modelRolloutsYesclip(r,1±ε)·AComplex RM, chat alignment
DPO2× modelPref. pairsNo−log σ(β·Δlog-ratio)Stable offline training, clean data
GRPO2× modelRolloutsYesPPO clip + group norm AVerifiable rewards, reasoning, text ⭐
IPO2× modelPref. pairsNo∥β·Δlog-ratio − 1∥²Avoids DPO degenerate solutions
KTO2× modelUnpaired labelsNoKahneman-Tversky value fnNo paired data; just (good/bad) labels
ORPO1× modelPref. pairsNoSFT + λ·odds-ratioMinimal memory; SFT + alignment in one pass
SimPO1× modelPref. pairsNo−log σ(β·Δavg-logprob − γ)No reference model, better length handling

Key Equations for IPO and ORPO

\mathcal{L}_{IPO}=\mathbb{E}\!\left[\left(\log\frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\log\frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}-\frac{1}{2\beta}\right)^{\!2}\right]
IPO [Azar et al. 2023] — identity loss avoids DPO's degenerate deterministic solution
\mathcal{L}_{ORPO}=\mathcal{L}_{SFT}-\lambda\,\mathbb{E}\!\left[\log\sigma\!\left(\log\frac{\mathrm{odds}_\theta(y_w|x)}{\mathrm{odds}_\theta(y_l|x)}\right)\right],\quad\mathrm{odds}_\theta(y|x)=\frac{\pi_\theta(y|x)}{1-\pi_\theta(y|x)}
ORPO [Hong et al. 2024] — no reference model needed; odds ratio as relative preference signal
8

Practical Guide for Your text Model

Theory done. Here is a concrete action plan for post-training your multimodal LLM.

Which Algorithm Should You Use?

✅ Recommendation: DPO first, then GRPO

Phase 1 — DPO: Collect preference pairs from your SFT model. Generate 2–4 outputs per training image, score with your reward function, take best/worst as chosen/rejected. Train DPO. Fast to set up, very stable, good baseline.

Phase 2 — GRPO: Once DPO baseline is answer, switch to GRPO with your rule-based text reward. GRPO explores online and will push beyond what offline DPO can achieve. With verifiable text rewards (validity + answer score), this is the strongest approach available.

Training Decision Tree

?
Do you have automated/verifiable rewards?
Can you programmatically evaluate correctness? (output validity, math, code execution) → YES: Use GRPO. Rule-based rewards outperform learned RMs. You cannot beat them on your own domain.
?
Only pairwise preference data, no automated reward?
Human raters pick A vs B → DPO or ORPO. Very limited memory (1 GPU)? → ORPO (1 model). Can't collect pairs, only (good/bad) labels? → KTO.
?
Memory constrained?
Running on 1 GPU? → ORPO or SimPO (1× model). Want online but cheap? → GRPO with LoRA on the policy (ref model stays full precision).

Full Two-Phase Pipeline

 full_pipeline.py
"""
Full text Post-Training: Phase 1 (DPO) + Phase 2 (GRPO)
Tested with trl >= 0.12 and transformers >= 4.45
"""
from trl import DPOTrainer, DPOConfig, GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch, re, math

# ═══════════════════════════════════════════════════════
# REWARD FUNCTIONS (used in both phases)
# ═══════════════════════════════════════════════════════
VALID_OPS = {"query","read_doc","revolve","citation","check_answer",
             "boolean_union","boolean_subtract","loft","sweep"}

def text_reward(completions, prompts=None, **kwargs):
    rewards = []
    for completion in completions:
        ops = re.findall(r'(\w+)\([^)]*\)', completion)
        if not ops:
            rewards.append(0.0); continue
        valid     = all(op in VALID_OPS for op in ops)
        r_valid   = 1.0 if valid else 0.0
        r_quality = quality_score(text, kwargs.get("target"))   # your impl
        r_eff     = max(0.0, 1.0 - len(ops) / 50.0)
        rewards.append(0.4 * r_valid + 0.5 * r_geom + 0.1 * r_eff)
    return rewards

# ═══════════════════════════════════════════════════════
# PHASE 0: Build DPO Dataset from SFT outputs
# ═══════════════════════════════════════════════════════
def build_preference_data(sft_model, tokenizer, questions, n_per_prompt=4):
    chosen, rejected, prompts = [], [], []
    for q in questions:
        prompt  = f"{q.text}\nAnswer:"
        outputs = [sft_model.generate(prompt, do_sample=True, temp=0.9)
                   for _ in range(n_per_prompt)]
        scores  = text_reward(outputs, target=q.reference)
        best, worst = max(range(n_per_prompt), key=lambda i: scores[i]), \
                      min(range(n_per_prompt), key=lambda i: scores[i])
        if scores[best] > scores[worst] + 0.1:   # only add meaningful gaps
            prompts.append(prompt)
            chosen.append(outputs[best])
            rejected.append(outputs[worst])
    return Dataset.from_dict({"prompt": prompts, "chosen": chosen, "rejected": rejected})

# ═══════════════════════════════════════════════════════
# PHASE 1: DPO
# ═══════════════════════════════════════════════════════
def phase1_dpo(sft_model_path, dpo_dataset):
    model     = AutoModelForCausalLM.from_pretrained(sft_model_path, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(sft_model_path)
    config    = DPOConfig(
        output_dir="ckpt/dpo", num_train_epochs=2,
        per_device_train_batch_size=2, gradient_accumulation_steps=8,
        learning_rate=5e-7, beta=0.1, max_length=2048,
        bf16=True, gradient_checkpointing=True, warmup_ratio=0.05,
    )
    DPOTrainer(model=model, ref_model=None, args=config,
               train_dataset=dpo_dataset, processing_class=tokenizer).train()
    model.save_pretrained("ckpt/dpo-final")
    return "ckpt/dpo-final"

# ═══════════════════════════════════════════════════════
# PHASE 2: GRPO
# ═══════════════════════════════════════════════════════
def phase2_grpo(dpo_ckpt, train_dataset):
    model     = AutoModelForCausalLM.from_pretrained(dpo_ckpt, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(dpo_ckpt)
    config    = GRPOConfig(
        output_dir="ckpt/grpo", num_train_epochs=3,
        per_device_train_batch_size=1, gradient_accumulation_steps=16,
        learning_rate=2e-7, num_generations=8, max_new_tokens=1024,
        temperature=0.9, kl_coef=0.04, cliprange=0.2,
        bf16=True, gradient_checkpointing=True, save_steps=200,
    )
    GRPOTrainer(model=model, reward_funcs=text_reward, args=config,
                train_dataset=train_dataset, processing_class=tokenizer).train()

if __name__ == "__main__":
    sft_ckpt   = "your-sft-model"
    images     = load_text_dataset()               # your data loader
    dpo_data   = build_preference_data(AutoModelForCausalLM.from_pretrained(sft_ckpt),
                                       AutoTokenizer.from_pretrained(sft_ckpt), images)
    dpo_ckpt   = phase1_dpo(sft_ckpt, dpo_data)
    phase2_grpo(dpo_ckpt, images)

Hyperparameter Starting Points

SettingDPOGRPONotes
Learning rate5e-72e-7DPO is offline; GRPO is noisier so go lower
β (KL coeff)0.10.04Too high = no learning. Too low = KL collapse. Start here.
Batch size16 effective16 effectiveUse grad accum to hit this on small GPUs
Group size GN/A8Increase to 16 for complex text tasks for more stable advantages
TemperatureN/A0.9Need diversity in the group. Don't go below 0.7.
Epochs23DPO overfits quickly; GRPO benefits from more steps
Max tokens20481024Keep generated length bounded to control cost

Common Pitfalls & Red Flags

🚨 KL Collapse

KL divergence spikes, generation quality drops, outputs become repetitive. Fix: Reduce LR, increase β, check reward normalization (rewards at wildly different scales cause this).

🚨 Reward Hacking

Reward goes up but actual quality drops. Fix: Add more reward components, use multiple complementary rewards, reduce group size in GRPO.

🚨 Mode Collapse

All G responses in GRPO group become identical (reward/std → 0). Fix: Increase sampling temperature, reduce kl_coef, add an explicit diversity reward.

🚨 Sparse Reward Signal

All rewards = 0 or 1, no gradient signal in between. Fix: Use smooth rewards like exp(−α·cd) instead of binary thresholds. Add partial credit for syntactically valid but semantically wrong outputs.

REF

References

  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
  2. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438
  3. Stiennon, N., Ouyang, L., et al. (2020). Learning to Summarize from Human Feedback. NeurIPS 2020. arXiv:2009.01325
  4. Ouyang, L., Wu, J., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS 2022. arXiv:2203.02155
  5. Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290
  6. Shao, Z., Wang, P., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300
  7. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948
  8. Azar, M. G., et al. (2023). A General Theoretical Paradigm to Understand Learning from Human Feedback. arXiv:2310.12036
  9. Ethayarajh, K., et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306
  10. Hong, J., et al. (2024). ORPO: Monolithic Preference Optimization without Reference Model. arXiv:2403.07691
  11. Meng, Y., et al. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv:2405.14734
  12. Williams, R.J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE). Machine Learning, 8, 229–256.
  13. Bradley, R.A. & Terry, M.E. (1952). Rank Analysis of Incomplete Block Designs I: The Method of Paired Comparisons. Biometrika, 39(3/4), 324–345.