Part 3 of the Post-Training Series

Agentic RL

When the model stops generating text and starts acting on the world. Multi-step. Tool-using. Self-correcting. And much harder to train — here is why, and how.

🔁 Continues from PPO + Beyond PPO 🛠️ Tool-use focus 🎬 5 animations 🔧 research agent end-to-end
Agentic RL overview
00

What Changes When the Model Acts

In the PPO and GRPO guides, the model generated a sequence of tokens and received a single scalar reward at the end. Everything happened inside the model. There was no outside world.

In agentic RL, the model reaches out and touches external systems: it executes code, calls APIs, queries databases, renders output, browses the web, writes files. Each action changes the state of the world. The model then observes what happened and decides what to do next. This loop repeats until the task is done.

AspectLLM Post-Training (PPO/GRPO)Agentic RL
What an action isOne token from a 30k+ vocabularyA structured tool call with parameters
StatePrompt + generated tokens so farConversation history + external world state
ObservationPrevious tokens (the model itself generated)Return value from executing a tool
Episode length50–500 tokens, one pass5–50+ tool-call steps, multi-turn
Reward timingSingle score at sequence endScore at task completion (many steps later)
Side effectsNone — text only, no world changeReal: files written, code run, output built
Self-correctionCannot — generation is one-shotYes — can observe result and adjust
Credit assignmentHard (7–500 tokens)Much harder (many steps, external state)
Rollout costCheap — just forward passesExpensive — real tool executions
🔧 The Tool-Use Shift

One-shot model: Reads the user's question. Generates one long answer in a single pass. Gets scored at the end. Done. Cannot check whether the partial answer was actually correct along the way.

Research agent: Reads the question. Calls search(q="topic") and sees a list of source links. Calls read_doc(doc_id) and gets the body. Calls check_answer(draft) and gets a quality score like "0.45, weak coverage". Revises the draft. Searches again with a tighter query, cites the new source, runs the check again. Loops until the score crosses the threshold. This is how a person actually researches an answer.

01

Notation — The Agentic MDP

All PPO and GRPO notation carries over unchanged. New symbols added for the agentic setting:

New symbols — all prior notation (πθ, πref, rφ, ε, β, λ, γ, δt, ÂAt) unchanged
otObservation at step tWhat the agent receives from the environment after taking action at-1. For text: current draft, validity string, answer score, sources list, etc.
atAgent action at step tNow a structured tool call — not a single token. E.g. read_doc(query_0, h=20). Contains a tool name and parameters.
htHistory at step tEverything the agent has seen: ht = (o0, a0, o1, a1, …, ot). This is the agent's full context. The policy conditions on ht, not just st.
τTrajectoryA complete episode: τ = (o0, a0, o1, a1, …, oT, R). T steps, each with one action and one observation, terminated by a reward.
THorizonMaximum number of agent steps per episode. For research agents: 10–25. For software engineering agents: up to 50+. Much larger than the "T" in PPO (token count).
EEnvironmentThe external system that accepts actions and returns observations. For text: the sandbox + renderer. Stateful — changes persist across steps within an episode.
R(τ)Trajectory rewardScalar reward for the complete episode. May be a single outcome reward at the end, or a sum of per-step rewards: R(τ) = Σt rt.
rtStep rewardReward at step t. Often 0 for all steps except the last. Can be non-zero for intermediate steps (progress reward, error penalty).
𝒮Action spaceThe set of all possible tool calls. Unlike the token vocabulary (discrete, 30k items), 𝒮 is structured: finite tool names but continuous or discrete parameters.

The POMDP: Partial Observability

In classic RL, the agent sees the full environment state st. In agentic settings, the agent only sees what tools return. This is a Partially Observable MDP (POMDP). The agent must maintain its own internal state estimate from accumulated observations.

For a research agent: the true state includes the full full reasoning chain, all face/sources IDs, material properties. But the agent only receives what it explicitly queries — a current draft, a answer score, a validity flag. It must infer the rest from what it has seen so far.

\pi_\theta(a_t \mid h_t),\quad h_t = (o_0, a_0, o_1, a_1, \ldots, o_t)
The agent policy conditions on the full history h_t, not just the current state. The transformer's context window IS the history.
02

The Agent Loop

Every agentic system is built around one cycle: Observe → Reason → Act → Observe. The model is the reasoning and acting component. The environment handles state and consequences.

Figure 1 — The Agent Loop: One Answer Generation Episode
One episode of a research agent writing an answer to a user question. Each horizontal band is one step: the agent observes a result, reasons about its next action, executes the tool, and receives a new observation. Reward arrives only at the very end, after all steps complete, which makes credit assignment the central challenge.

The ReAct Pattern: Think Before You Act

The dominant agent architecture wraps each action in a chain-of-thought reasoning step. Before outputting the tool call, the model generates a thought — a scratchpad that makes implicit reasoning explicit and helps the model plan better actions.

 react_trace.txt — one agent step
Observation: [Current draft answers the main question but is missing specifics on follow-ups.]
              Answer score against rubric: 0.34  (66% match)

Thought: The base answer covers the high-level definition. The rubric lists four
         follow-up topics I have not addressed yet. I should search for one of them
         and pull a paragraph that I can cite.

Action: search(q="follow-up topic 1", n=3)

Observation: 3 results returned. Top result id=doc_017, snippet 28 tokens.

Thought: doc_017 looks directly relevant. Read it and add a sentence to the draft
         that cites this source.

Action: read_doc(doc_id="doc_017", n=20)

Observation: Document body returned. Draft updated with one new sentence + citation.
             10 paragraphs, 24 cited sources so far. Still well formed.

The thought is generated autoregressively by the LLM and is part of the action token sequence. It is not a separate model — it is the same policy πθ generating reasoning text before the structured tool call.

03

Credit Assignment Across Long Horizons

We already saw in PPO that per-token credit assignment is hard when the reward only arrives at the final token. Agentic RL makes this dramatically worse.

Figure 2 — Credit Assignment: Token-Level PPO vs Agent-Level RL
Top: standard PPO on a 7-token response sequence. The reward arrives at position 6 and propagates backward through the value function — hard but manageable. Bottom: a 14-step research agent episode. The reward arrives at step 13. The value function must learn to predict the task outcome from early incomplete observations like "query created, 4 results" — a much harder prediction problem over a much longer horizon.

Why It Gets Exponentially Harder

For PPO with γ=1, the return at position t was Rt ≈ r(x,y) for all t — nearly flat, which is manageable. For an agentic trajectory:

R_t = \sum_{l=t}^{T} \gamma^{T-t}\, r_l \approx \gamma^{T-t} \cdot R(\tau)
With T=15 steps and γ=0.95: R_0 = 0.95^15 × R(τ) = 0.46 × R(τ). The return at the first step is barely half the actual reward.

Three compounding problems:

❌ Longer horizon = more dilution

With 15 steps at γ=0.95, step 0 only sees 46% of the final reward. The value function at step 0 must make predictions based on an early partial state (just a rendered target question), which is nearly useless information about whether the final answer will be accurate.

❌ External state = larger value space

The value function Vψ(ht) must predict returns from the full history ht — not just token embeddings but current drafts, answer scores, tool outputs. The input space is orders of magnitude larger than token-level PPO.

💡 The silver lining for research agents

The observation at each step is informative. Unlike token-level PPO where intermediate tokens give no task signal, a research agent's check_answer(target) call returns a direct quality estimate at every step. This makes per-step reward design possible — and is exactly what makes agentic RL more tractable than it might seem from the credit assignment analysis alone.

04

The Action Space

In token-level PPO, every action is a draw from a 30,000-token vocabulary — a flat discrete distribution. In agentic RL, actions are structured tool calls with semantic meaning and typed parameters.

Figure 3 — Token Vocabulary vs Tool Action Space
Left: the token vocabulary — 30k+ items, mostly meaningless subword fragments. The model samples one token at a time. Right: the research agent's tool action space — 12 semantically meaningful tools, each with typed parameters. Actions are parsed from the model's generated text into structured calls. The smaller space means much more targeted exploration.

Handling the Action as Text

The agent still generates text with the same LLM — there is no separate "action decoder". The model generates a string like read_doc(query_0, h=20, direction="up") autoregressively, token by token. This string is then parsed into a structured call and executed by the environment. Invalid strings (malformed syntax, unknown tools, out-of-range parameters) result in an error observation and optionally a small penalty reward.

 tool_env.py — action execution
class ToolEnv:
    """The environment that accepts agent tool calls and returns observations."""

    TOOLS = {
        "query":           {"params": ["shape", "w", "h", "cx", "cy", "r"]},
        "read_doc":          {"params": ["query_id", "height", "direction", "operation"]},
        "citation":           {"params": ["source_id", "radius"]},
        "check_answer":          {"params": ["source_id", "distance"]},
        "boolean_subtract": {"params": ["body_a", "body_b"]},
        "render_current":   {"params": []},
        "check_answer":  {"params": ["target_path"]},
        "check_validity":   {"params": []},
        "get_edge_list":    {"params": []},
        "measure":          {"params": ["source_id", "axis"]},
        "undo":             {"params": []},
        "reset":            {"params": []},
    }

    def step(self, action_str: str) -> tuple[str, float, bool]:
        """
        Execute one agent action.
        Returns: (observation_str, step_reward, is_terminal)
        """
        try:
            tool, args = self._parse(action_str)         # parse text → structured call
            result = self._execute(tool, args)           # run in sandbox
            obs = self._format_observation(tool, result) # format result as text
            r_step = self._step_reward(tool, result)     # small immediate reward
            done = self._is_terminal(result)
        except ParseError:
            obs = f"ERROR: Could not parse '{action_str}'. Valid tools: {list(self.TOOLS)}"
            r_step = -0.05                               # small penalty for invalid action
            done = False
        except ExecutionError as e:
            obs = f"EXECUTION ERROR: {e}. Operation was not applied."
            r_step = -0.02
            done = False

        return obs, r_step, done

    def _step_reward(self, tool: str, result) -> float:
        """Immediate per-step reward signal (small, informative)."""
        if tool == "check_answer":
            # Progress reward: how much did check_answer improve this step?
            prev = self.last_check_answer or 1.0
            curr = result["distance"]
            self.last_check_answer = curr
            return max(0, (prev - curr) * 2.0)   # reward for improvement
        if tool in ("citation", "check_answer") and result["valid"]:
            return 0.02                           # small reward for valid finishing ops
        return 0.0
05

Reward Design for Agents

Reward design is the hardest part of agentic RL. Unlike one-shot generation where you score a complete response, the agent produces an evolving intermediate state. What do you reward and when?

The Four Reward Components

✅ Outcome Reward (essential)

Did the task succeed? For text: is the final answer semantically accurate? This is the primary signal. Sparse — arrives only at episode end — but most reliable. R_outcome = 0.5·accuracy + 0.4·validity + 0.1·efficiency.

📈 Progress Reward (helpful)

Are we getting closer? For text: reward proportional to answer score improvement each step. Dense signal that helps with credit assignment. Risk: agent learns to maximize check_answer improvement, not final quality.

⚡ Efficiency Reward (optional)

Fewer steps is better. Penalty of −0.01 per step beyond a minimum. Prevents the agent from calling check_answer() in a loop. Important for deployment cost.

❌ Error Penalty (guardrail)

Penalty for invalid tool calls (−0.05), syntax errors (−0.02), using non-existent query IDs. Helps the model learn the action grammar quickly without wasting training on clearly wrong actions.

R(\tau) = \underbrace{R_{outcome}}_{\text{task quality at end}} + \sum_{t=1}^{T}\underbrace{r_t^{progress}}_{\text{check_answer improvement}} + \sum_{t=1}^{T}\underbrace{r_t^{efficiency}}_{\text{step penalty}} + \sum_{t=1}^{T}\underbrace{r_t^{error}}_{\text{invalid action penalty}}
Full trajectory reward for a research agent episode. R_outcome dominates; the other terms shape behavior and accelerate learning.
🔧 text advantage: progress reward is free

For language tasks, measuring "are we halfway there?" is subjective and expensive. For text, check_answer(target) gives an exact float every step. The agent can call it at any point. This makes agentic text training significantly easier than agentic coding or writing, where intermediate progress is much harder to quantify.

06

Key Algorithms

ReAct — Reasoning + Acting (Yao et al., 2022)

The foundational pattern. No RL required — just prompting. The key insight is that interleaving chain-of-thought reasoning with tool calls makes agents dramatically more accurate than calling tools blindly.

\text{Thought}_t \to a_t \to o_t \to \text{Thought}_{t+1} \to a_{t+1} \to \cdots
ReAct trace structure. Thoughts are part of the policy output — same model, same forward pass, just text before the action.

ReAct is the zero-shot baseline for any agent. Before training with RL, test your agent in ReAct mode — it tells you whether the environment and tools are well-designed. If ReAct can't solve even easy cases, the reward signal or action space needs fixing first.

Reflexion — Verbal Reinforcement (Shinn et al., 2023)

After a failed episode, the model generates a verbal self-critique: "What did I do wrong? What should I try differently?" This reflection is prepended to the context for the next episode attempt. No gradient updates — it's in-context learning from failure.

 reflexion_example.txt
--- Episode 1 failed (check_answer = 0.52, target < 0.1) ---

Reflection: I only pulled 2 of the 4 follow-up topics before timing out, so the rubric
            kept docking the answer for missing coverage. I also stopped citing after
            the third source even though doc_017 had more relevant content I never used.
            Next attempt: pull all 4 follow-up topics first, then cite from each, and
            do not start drafting until I have a source per bullet point.

--- Episode 2 (with reflection in context) ---
Action: list_rubric_items(question)
Observation: Rubric has 4 required follow-up bullets.
Action: search(q="follow-up bullet 1", n=5)
...

RLVR for Agents — The Modern Standard

Apply GRPO/PPO directly to agentic trajectories with verifiable outcome rewards. Same algorithm as the LLM post-training guides, but the "episode" is now a full multi-step trajectory instead of a single generation pass.

\mathcal{J}_{agent}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau)\right] - \beta\,D_{KL}\!\left(\pi_\theta \,\|\, \pi_{ref}\right)
Same master objective as PPO/GRPO — but τ is now a multi-step trajectory, not a single-step generation. R(τ) is the full episode reward.

Using GRPO for agents: sample G trajectories for the same task (same target question), compute trajectory rewards R(τ1), …, R(τG), normalize to get advantages, update policy with PPO clip. The group mean of trajectory rewards is the critic-free baseline.

Tree Search — Planning Ahead

Instead of greedily executing actions one at a time, tree search simulates multiple possible futures before committing to an action. Monte Carlo Tree Search (MCTS) is the most common variant.

Figure 4 — MCTS for Agent Planning vs Greedy Execution
Left: greedy agent executes each action immediately and cannot recover from a bad early choice. Right: MCTS agent simulates several branches before committing. At each decision point it "imagines" 3–4 continuations, evaluates them with the value function, and picks the branch with the highest expected return. This requires many more model calls but avoids getting stuck in dead ends.
07

Training the Agent

The training loop is PPO/GRPO at the trajectory level, but rollout collection is much more expensive and requires parallel environment management.

1
Rollout collection (expensive)
Run N parallel environments. For each, the agent generates a complete multi-step trajectory: up to T tool calls, receiving observations each time. Each trajectory produces a sequence of (h_t, a_t, r_t) tuples. Unlike token generation, this requires real tool execution at every step — renders, validity checks, output operations.
2
Compute trajectory returns R(τ)
For each trajectory, compute returns backwards: R_T = r_T (outcome reward), then R_t = r_t + γR_{t+1} for earlier steps. The outcome reward dominates; per-step rewards contribute smaller corrections.
3
Advantage estimation (GRPO-style)
Group G trajectories for the same task. Normalize: Â_i = (R(τ_i) − μ_R) / σ_R. Or use leave-one-out (RLOO) for unbiased estimates. Same advantage for every action in trajectory i — the agentic equivalent of GRPO's per-response advantage.
4
PPO-style policy update
For each (h_t, a_t) pair in each trajectory, compute the log-probability ratio πθ(a_t|h_t) / πold(a_t|h_t). Apply PPO clip with advantage Â_i. Update policy weights. The key difference from token-level PPO: a_t is now a full tool-call string, not a single token — its log-probability is the sum of log-probs of all tokens in the tool-call string.
5
Environment reset & repeat
Reset all environments (undo all output changes, start fresh). Collect new rollouts with the updated policy. Repeat. Each iteration = one round of trajectory collection + one policy update.
 agentic_grpo_train.py
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# ── Multi-Turn Reward Function ────────────────────────────────────────────────
def text_agent_reward(completions, prompts, **kwargs):
    """
    completions: list of full conversation strings (all turns joined)
    Each completion is a multi-turn trace:
      "Thought: ... Action: search(...) Obs: ... Thought: ... Action: ..."
    We parse the final answer state and compute the outcome reward.
    """
    rewards = []
    for conv in completions:
        # Parse the final state from the trajectory
        final_answer = extract_final_answer(conv)    # your text parser
        target = kwargs.get("target_text")

        if final_answer is None:
            rewards.append(0.0)
            continue

        r_valid   = check_validity(final_answer)
        r_geom    = text_reward(final_answer, target)
        r_steps   = efficiency_reward(count_steps(conv))
        r_errors  = error_penalty(count_invalid_calls(conv))

        rewards.append(0.5*r_geom + 0.4*r_valid + 0.1*r_steps + r_errors)

    return rewards

# ── Multi-Turn Dataset ────────────────────────────────────────────────────────
# Each example contains the initial prompt (system + target question description)
# The model generates the full multi-turn conversation: thoughts + tool calls
# The environment responses are injected between turns during generation

# ── Config ────────────────────────────────────────────────────────────────────
config = GRPOConfig(
    output_dir="./llm-agent-grpo",
    num_generations=6,          # G=6 trajectories per task (expensive!)
    max_new_tokens=2048,        # full multi-turn trace budget
    temperature=0.9,            # need diverse trajectories
    kl_coef=0.02,               # lighter KL for longer generations
    learning_rate=1e-7,         # very conservative for agentic tasks
    per_device_train_batch_size=1,
    gradient_accumulation_steps=32,
    num_train_epochs=3,
    bf16=True,
)

model = AutoModelForCausalLM.from_pretrained("your-llm-sft-model", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("your-llm-sft-model")

trainer = GRPOTrainer(
    model=model,
    reward_funcs=text_agent_reward,
    args=config,
    train_dataset=text_task_dataset,   # each example: target question + task description
    processing_class=tokenizer,
)
trainer.train()
⚠️ The multi-turn generation challenge

Standard GRPO generates a complete response in one pass. Agentic training requires interleaving model generation with environment execution: model generates a thought+action, environment executes and returns observation, model continues. This requires custom generation loops that inject observations between model outputs. TRL's vLLM integration and multi-turn generation tools handle this, but it's more complex than one-shot GRPO.

08

The Research Agent — Complete Example

A full 12-step episode: the agent writing a complete answer to a user question. Every action, observation, and reward shown.

Figure 5 — Research Agent: 12-Step Answer Trajectory
Complete agent trajectory from an empty draft to a finished answer. Top lane: the agent's actions (tool calls). Bottom lane: cumulative answer score, the agent's measurable progress toward the rubric. The agent calls check_answer() at key checkpoints, creating a dense reward signal from what would otherwise be a sparse end-of-episode reward. Click through to see each step.

Tool Definitions for the Research Agent

ToolParametersReturnsWhen to use
search()q, nlist of doc_id + snippetsFinding candidate sources for a topic
read_doc()doc_id, n_chunksdocument body, lengthPulling the full text of a candidate source
cite()doc_id, spancitation_id, formatted refAdding an in-text citation to the draft
add_paragraph()text, positionparagraph_id, lengthInserting a new section into the draft
edit_paragraph()paragraph_id, new_textsuccess, diffRewriting a paragraph after new evidence
show_answer()current full draftReviewing the answer before scoring
check_answer()draftscore (float), rubric breakdownQuantitative rubric check
check_validity()bool, list of issuesCatches malformed JSON, broken citations, etc.
list_sources()list of cited doc_idsAuditing which sources are already used
word_count()paragraph_idintChecking length against a rubric limit
undo()success, previous stateReverting a bad edit
09

Open Challenges

🐢 Slow Rollouts

Each training episode requires 10–20 real tool executions. A batch of 64 trajectories at G=6 = 384 parallel text solver runs. Training throughput is dominated by environment speed, not model speed. Mitigation: parallelize environments aggressively, cache common intermediate states, use faster approximate solvers during training.

🌀 Long-Horizon Credit Assignment

Even with progress rewards, connecting a good early action (a well-proportioned query at step 2) to the final reward at step 15 is hard. The value function must track partial text state quality — a much harder prediction problem than token-level value estimation.

🧭 Exploration in Structured Space

Sampling random tokens works for LLM post-training. For agents, you need to explore meaningfully different strategies: try a different query before re-reading the same doc, try citing a new source before re-editing existing paragraphs, try restructuring the draft instead of patching it. Temperature-based token sampling alone does not generate diverse enough tool sequences.

🔒 Safety and Irreversibility

In deployed agents (not just text training), wrong tool calls can be irreversible: emails sent, files deleted, APIs called. The KL penalty helps but is not a safety guarantee. Constitutional constraints and action filtering are active research areas for production agents.

What Works Well Today

✅ The state of the art in 2025

SWE-bench (software engineering): Agents using RLVR with verifiable tests (does the code pass the unit tests?) achieve 50%+ on hard benchmarks. The verifiable reward signal makes this tractable.

Math/code with tool use: Agents that can call a Python interpreter to verify intermediate calculations have dramatically improved on complex multi-step reasoning benchmarks.

text specifically: The combination of deterministic tool feedback + computable output rewards makes text an ideal testbed for agentic RL. Progress rewards from check_answer() provide the dense signal that makes long-horizon credit assignment manageable.

The key pattern: Agentic RL works best when the environment provides verifiable, graded, dense feedback at every step — which is exactly what a well-designed sandbox provides.

REF

References

  1. Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629
  2. Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366
  3. Yang, J. et al. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793
  4. Zhou, S. et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024. arXiv:2307.13854
  5. Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688
  6. Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761
  7. Wang, X. et al. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv:2407.16741
  8. Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972
  9. Silver, D. et al. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature. (Foundation for MCTS+RL)
  10. Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction, 2nd Edition. MIT Press. (MDP/POMDP foundations)