A Series · The Notebook

Reinforcement Learning

Policy gradients, value-based methods, and the full post-training pipeline. PPO, DPO, GRPO, Q-learning, and agentic RL — the algorithms that turn a pretrained base model into the assistant you actually use.

7 Stories

~159m Total Read

2026 Last Updated

Nothing matched. Try a shorter query.

Sort

Jun 17, 2026 25 min read

Process Reward Models

Outcome reward models tell you whether the final answer is right. Process reward models tell you which step went wrong. Eleven sections …

Jun 13, 2026 25 min read

Markov, Monte Carlo, TD

Markov property, Monte Carlo, temporal difference — three definitions that evaporate every time you read them. Here they are built as one …

Jun 12, 2026 30 min read

Q-Learning

Ten sections on the value side of reinforcement learning. Starts with the Q-function and why it is strictly more useful than the value …

May 29, 2026 18 min read

Agentic RL

What changes for RL when each action is a tool call (search, code execution, calculators, browsers) and the reward only arrives at the end …

May 22, 2026 18 min read

The Post-Training Guide

Everything that happens between a pretrained base model and the chatbot people end up shipping. Supervised fine tuning, reward modeling, …

May 15, 2026 18 min read

Beyond PPO

PPO is not the only option anymore. DPO, GRPO, ORPO, KTO, and a few REINFORCE variants have all picked up traction for LLM alignment, …

May 8, 2026 25 min read

PPO Deep Dive

A walkthrough of Proximal Policy Optimization from the ground up. What the clipped ratio is doing, why importance sampling lets you reuse …

End of series.

Back to AI Notebook