Agentic RL
When the model stops generating text and starts acting on the world. Multi-step. Tool-using. Self-correcting. And much harder to train — here is why, and how.
What Changes When the Model Acts
In the PPO and GRPO guides, the model generated a sequence of tokens and received a single scalar reward at the end. Everything happened inside the model. There was no outside world.
In agentic RL, the model reaches out and touches external systems: it executes code, calls APIs, queries databases, renders output, browses the web, writes files. Each action changes the state of the world. The model then observes what happened and decides what to do next. This loop repeats until the task is done.
| Aspect | LLM Post-Training (PPO/GRPO) | Agentic RL |
|---|---|---|
| What an action is | One token from a 30k+ vocabulary | A structured tool call with parameters |
| State | Prompt + generated tokens so far | Conversation history + external world state |
| Observation | Previous tokens (the model itself generated) | Return value from executing a tool |
| Episode length | 50–500 tokens, one pass | 5–50+ tool-call steps, multi-turn |
| Reward timing | Single score at sequence end | Score at task completion (many steps later) |
| Side effects | None — text only, no world change | Real: files written, code run, output built |
| Self-correction | Cannot — generation is one-shot | Yes — can observe result and adjust |
| Credit assignment | Hard (7–500 tokens) | Much harder (many steps, external state) |
| Rollout cost | Cheap — just forward passes | Expensive — real tool executions |
One-shot model: Reads the user's question. Generates one long answer in a single pass. Gets scored at the end. Done. Cannot check whether the partial answer was actually correct along the way.
Research agent: Reads the question. Calls search(q="topic") and sees a list of source links. Calls read_doc(doc_id) and gets the body. Calls check_answer(draft) and gets a quality score like "0.45, weak coverage". Revises the draft. Searches again with a tighter query, cites the new source, runs the check again. Loops until the score crosses the threshold. This is how a person actually researches an answer.
Notation — The Agentic MDP
All PPO and GRPO notation carries over unchanged. New symbols added for the agentic setting:
| ot | Observation at step t | What the agent receives from the environment after taking action at-1. For text: current draft, validity string, answer score, sources list, etc. |
| at | Agent action at step t | Now a structured tool call — not a single token. E.g. read_doc(query_0, h=20). Contains a tool name and parameters. |
| ht | History at step t | Everything the agent has seen: ht = (o0, a0, o1, a1, …, ot). This is the agent's full context. The policy conditions on ht, not just st. |
| τ | Trajectory | A complete episode: τ = (o0, a0, o1, a1, …, oT, R). T steps, each with one action and one observation, terminated by a reward. |
| T | Horizon | Maximum number of agent steps per episode. For research agents: 10–25. For software engineering agents: up to 50+. Much larger than the "T" in PPO (token count). |
| E | Environment | The external system that accepts actions and returns observations. For text: the sandbox + renderer. Stateful — changes persist across steps within an episode. |
| R(τ) | Trajectory reward | Scalar reward for the complete episode. May be a single outcome reward at the end, or a sum of per-step rewards: R(τ) = Σt rt. |
| rt | Step reward | Reward at step t. Often 0 for all steps except the last. Can be non-zero for intermediate steps (progress reward, error penalty). |
| 𝒮 | Action space | The set of all possible tool calls. Unlike the token vocabulary (discrete, 30k items), 𝒮 is structured: finite tool names but continuous or discrete parameters. |
The POMDP: Partial Observability
In classic RL, the agent sees the full environment state st. In agentic settings, the agent only sees what tools return. This is a Partially Observable MDP (POMDP). The agent must maintain its own internal state estimate from accumulated observations.
For a research agent: the true state includes the full full reasoning chain, all face/sources IDs, material properties. But the agent only receives what it explicitly queries — a current draft, a answer score, a validity flag. It must infer the rest from what it has seen so far.
The Agent Loop
Every agentic system is built around one cycle: Observe → Reason → Act → Observe. The model is the reasoning and acting component. The environment handles state and consequences.
The ReAct Pattern: Think Before You Act
The dominant agent architecture wraps each action in a chain-of-thought reasoning step. Before outputting the tool call, the model generates a thought — a scratchpad that makes implicit reasoning explicit and helps the model plan better actions.
Observation: [Current draft answers the main question but is missing specifics on follow-ups.]
Answer score against rubric: 0.34 (66% match)
Thought: The base answer covers the high-level definition. The rubric lists four
follow-up topics I have not addressed yet. I should search for one of them
and pull a paragraph that I can cite.
Action: search(q="follow-up topic 1", n=3)
Observation: 3 results returned. Top result id=doc_017, snippet 28 tokens.
Thought: doc_017 looks directly relevant. Read it and add a sentence to the draft
that cites this source.
Action: read_doc(doc_id="doc_017", n=20)
Observation: Document body returned. Draft updated with one new sentence + citation.
10 paragraphs, 24 cited sources so far. Still well formed.
The thought is generated autoregressively by the LLM and is part of the action token sequence. It is not a separate model — it is the same policy πθ generating reasoning text before the structured tool call.
Credit Assignment Across Long Horizons
We already saw in PPO that per-token credit assignment is hard when the reward only arrives at the final token. Agentic RL makes this dramatically worse.
Why It Gets Exponentially Harder
For PPO with γ=1, the return at position t was Rt ≈ r(x,y) for all t — nearly flat, which is manageable. For an agentic trajectory:
Three compounding problems:
With 15 steps at γ=0.95, step 0 only sees 46% of the final reward. The value function at step 0 must make predictions based on an early partial state (just a rendered target question), which is nearly useless information about whether the final answer will be accurate.
The value function Vψ(ht) must predict returns from the full history ht — not just token embeddings but current drafts, answer scores, tool outputs. The input space is orders of magnitude larger than token-level PPO.
The observation at each step is informative. Unlike token-level PPO where intermediate tokens give no task signal, a research agent's check_answer(target) call returns a direct quality estimate at every step. This makes per-step reward design possible — and is exactly what makes agentic RL more tractable than it might seem from the credit assignment analysis alone.
The Action Space
In token-level PPO, every action is a draw from a 30,000-token vocabulary — a flat discrete distribution. In agentic RL, actions are structured tool calls with semantic meaning and typed parameters.
Handling the Action as Text
The agent still generates text with the same LLM — there is no separate "action decoder". The model generates a string like read_doc(query_0, h=20, direction="up") autoregressively, token by token. This string is then parsed into a structured call and executed by the environment. Invalid strings (malformed syntax, unknown tools, out-of-range parameters) result in an error observation and optionally a small penalty reward.
class ToolEnv:
"""The environment that accepts agent tool calls and returns observations."""
TOOLS = {
"query": {"params": ["shape", "w", "h", "cx", "cy", "r"]},
"read_doc": {"params": ["query_id", "height", "direction", "operation"]},
"citation": {"params": ["source_id", "radius"]},
"check_answer": {"params": ["source_id", "distance"]},
"boolean_subtract": {"params": ["body_a", "body_b"]},
"render_current": {"params": []},
"check_answer": {"params": ["target_path"]},
"check_validity": {"params": []},
"get_edge_list": {"params": []},
"measure": {"params": ["source_id", "axis"]},
"undo": {"params": []},
"reset": {"params": []},
}
def step(self, action_str: str) -> tuple[str, float, bool]:
"""
Execute one agent action.
Returns: (observation_str, step_reward, is_terminal)
"""
try:
tool, args = self._parse(action_str) # parse text → structured call
result = self._execute(tool, args) # run in sandbox
obs = self._format_observation(tool, result) # format result as text
r_step = self._step_reward(tool, result) # small immediate reward
done = self._is_terminal(result)
except ParseError:
obs = f"ERROR: Could not parse '{action_str}'. Valid tools: {list(self.TOOLS)}"
r_step = -0.05 # small penalty for invalid action
done = False
except ExecutionError as e:
obs = f"EXECUTION ERROR: {e}. Operation was not applied."
r_step = -0.02
done = False
return obs, r_step, done
def _step_reward(self, tool: str, result) -> float:
"""Immediate per-step reward signal (small, informative)."""
if tool == "check_answer":
# Progress reward: how much did check_answer improve this step?
prev = self.last_check_answer or 1.0
curr = result["distance"]
self.last_check_answer = curr
return max(0, (prev - curr) * 2.0) # reward for improvement
if tool in ("citation", "check_answer") and result["valid"]:
return 0.02 # small reward for valid finishing ops
return 0.0
Reward Design for Agents
Reward design is the hardest part of agentic RL. Unlike one-shot generation where you score a complete response, the agent produces an evolving intermediate state. What do you reward and when?
The Four Reward Components
Did the task succeed? For text: is the final answer semantically accurate? This is the primary signal. Sparse — arrives only at episode end — but most reliable. R_outcome = 0.5·accuracy + 0.4·validity + 0.1·efficiency.
Are we getting closer? For text: reward proportional to answer score improvement each step. Dense signal that helps with credit assignment. Risk: agent learns to maximize check_answer improvement, not final quality.
Fewer steps is better. Penalty of −0.01 per step beyond a minimum. Prevents the agent from calling check_answer() in a loop. Important for deployment cost.
Penalty for invalid tool calls (−0.05), syntax errors (−0.02), using non-existent query IDs. Helps the model learn the action grammar quickly without wasting training on clearly wrong actions.
For language tasks, measuring "are we halfway there?" is subjective and expensive. For text, check_answer(target) gives an exact float every step. The agent can call it at any point. This makes agentic text training significantly easier than agentic coding or writing, where intermediate progress is much harder to quantify.
Key Algorithms
ReAct — Reasoning + Acting (Yao et al., 2022)
The foundational pattern. No RL required — just prompting. The key insight is that interleaving chain-of-thought reasoning with tool calls makes agents dramatically more accurate than calling tools blindly.
ReAct is the zero-shot baseline for any agent. Before training with RL, test your agent in ReAct mode — it tells you whether the environment and tools are well-designed. If ReAct can't solve even easy cases, the reward signal or action space needs fixing first.
Reflexion — Verbal Reinforcement (Shinn et al., 2023)
After a failed episode, the model generates a verbal self-critique: "What did I do wrong? What should I try differently?" This reflection is prepended to the context for the next episode attempt. No gradient updates — it's in-context learning from failure.
--- Episode 1 failed (check_answer = 0.52, target < 0.1) ---
Reflection: I only pulled 2 of the 4 follow-up topics before timing out, so the rubric
kept docking the answer for missing coverage. I also stopped citing after
the third source even though doc_017 had more relevant content I never used.
Next attempt: pull all 4 follow-up topics first, then cite from each, and
do not start drafting until I have a source per bullet point.
--- Episode 2 (with reflection in context) ---
Action: list_rubric_items(question)
Observation: Rubric has 4 required follow-up bullets.
Action: search(q="follow-up bullet 1", n=5)
...
RLVR for Agents — The Modern Standard
Apply GRPO/PPO directly to agentic trajectories with verifiable outcome rewards. Same algorithm as the LLM post-training guides, but the "episode" is now a full multi-step trajectory instead of a single generation pass.
Using GRPO for agents: sample G trajectories for the same task (same target question), compute trajectory rewards R(τ1), …, R(τG), normalize to get advantages, update policy with PPO clip. The group mean of trajectory rewards is the critic-free baseline.
Tree Search — Planning Ahead
Instead of greedily executing actions one at a time, tree search simulates multiple possible futures before committing to an action. Monte Carlo Tree Search (MCTS) is the most common variant.
Training the Agent
The training loop is PPO/GRPO at the trajectory level, but rollout collection is much more expensive and requires parallel environment management.
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# ── Multi-Turn Reward Function ────────────────────────────────────────────────
def text_agent_reward(completions, prompts, **kwargs):
"""
completions: list of full conversation strings (all turns joined)
Each completion is a multi-turn trace:
"Thought: ... Action: search(...) Obs: ... Thought: ... Action: ..."
We parse the final answer state and compute the outcome reward.
"""
rewards = []
for conv in completions:
# Parse the final state from the trajectory
final_answer = extract_final_answer(conv) # your text parser
target = kwargs.get("target_text")
if final_answer is None:
rewards.append(0.0)
continue
r_valid = check_validity(final_answer)
r_geom = text_reward(final_answer, target)
r_steps = efficiency_reward(count_steps(conv))
r_errors = error_penalty(count_invalid_calls(conv))
rewards.append(0.5*r_geom + 0.4*r_valid + 0.1*r_steps + r_errors)
return rewards
# ── Multi-Turn Dataset ────────────────────────────────────────────────────────
# Each example contains the initial prompt (system + target question description)
# The model generates the full multi-turn conversation: thoughts + tool calls
# The environment responses are injected between turns during generation
# ── Config ────────────────────────────────────────────────────────────────────
config = GRPOConfig(
output_dir="./llm-agent-grpo",
num_generations=6, # G=6 trajectories per task (expensive!)
max_new_tokens=2048, # full multi-turn trace budget
temperature=0.9, # need diverse trajectories
kl_coef=0.02, # lighter KL for longer generations
learning_rate=1e-7, # very conservative for agentic tasks
per_device_train_batch_size=1,
gradient_accumulation_steps=32,
num_train_epochs=3,
bf16=True,
)
model = AutoModelForCausalLM.from_pretrained("your-llm-sft-model", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("your-llm-sft-model")
trainer = GRPOTrainer(
model=model,
reward_funcs=text_agent_reward,
args=config,
train_dataset=text_task_dataset, # each example: target question + task description
processing_class=tokenizer,
)
trainer.train()
Standard GRPO generates a complete response in one pass. Agentic training requires interleaving model generation with environment execution: model generates a thought+action, environment executes and returns observation, model continues. This requires custom generation loops that inject observations between model outputs. TRL's vLLM integration and multi-turn generation tools handle this, but it's more complex than one-shot GRPO.
The Research Agent — Complete Example
A full 12-step episode: the agent writing a complete answer to a user question. Every action, observation, and reward shown.
Tool Definitions for the Research Agent
| Tool | Parameters | Returns | When to use |
|---|---|---|---|
| search() | q, n | list of doc_id + snippets | Finding candidate sources for a topic |
| read_doc() | doc_id, n_chunks | document body, length | Pulling the full text of a candidate source |
| cite() | doc_id, span | citation_id, formatted ref | Adding an in-text citation to the draft |
| add_paragraph() | text, position | paragraph_id, length | Inserting a new section into the draft |
| edit_paragraph() | paragraph_id, new_text | success, diff | Rewriting a paragraph after new evidence |
| show_answer() | — | current full draft | Reviewing the answer before scoring |
| check_answer() | draft | score (float), rubric breakdown | Quantitative rubric check |
| check_validity() | — | bool, list of issues | Catches malformed JSON, broken citations, etc. |
| list_sources() | — | list of cited doc_ids | Auditing which sources are already used |
| word_count() | paragraph_id | int | Checking length against a rubric limit |
| undo() | — | success, previous state | Reverting a bad edit |
Open Challenges
Each training episode requires 10–20 real tool executions. A batch of 64 trajectories at G=6 = 384 parallel text solver runs. Training throughput is dominated by environment speed, not model speed. Mitigation: parallelize environments aggressively, cache common intermediate states, use faster approximate solvers during training.
Even with progress rewards, connecting a good early action (a well-proportioned query at step 2) to the final reward at step 15 is hard. The value function must track partial text state quality — a much harder prediction problem than token-level value estimation.
Sampling random tokens works for LLM post-training. For agents, you need to explore meaningfully different strategies: try a different query before re-reading the same doc, try citing a new source before re-editing existing paragraphs, try restructuring the draft instead of patching it. Temperature-based token sampling alone does not generate diverse enough tool sequences.
In deployed agents (not just text training), wrong tool calls can be irreversible: emails sent, files deleted, APIs called. The KL penalty helps but is not a safety guarantee. Constitutional constraints and action filtering are active research areas for production agents.
What Works Well Today
SWE-bench (software engineering): Agents using RLVR with verifiable tests (does the code pass the unit tests?) achieve 50%+ on hard benchmarks. The verifiable reward signal makes this tractable.
Math/code with tool use: Agents that can call a Python interpreter to verify intermediate calculations have dramatically improved on complex multi-step reasoning benchmarks.
text specifically: The combination of deterministic tool feedback + computable output rewards makes text an ideal testbed for agentic RL. Progress rewards from check_answer() provide the dense signal that makes long-horizon credit assignment manageable.
The key pattern: Agentic RL works best when the environment provides verifiable, graded, dense feedback at every step — which is exactly what a well-designed sandbox provides.
References
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629
- Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366
- Yang, J. et al. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793
- Zhou, S. et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024. arXiv:2307.13854
- Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688
- Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761
- Wang, X. et al. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv:2407.16741
- Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972
- Silver, D. et al. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature. (Foundation for MCTS+RL)
- Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction, 2nd Edition. MIT Press. (MDP/POMDP foundations)