Policy gradients, value-based methods, and the full post-training pipeline. PPO, DPO, GRPO, Q-learning, and agentic RL — the algorithms that turn a pretrained base model into the assistant you actually use.
Outcome reward models tell you whether the final answer is right. Process reward models tell you which step went wrong. Eleven sections …
Markov property, Monte Carlo, temporal difference — three definitions that evaporate every time you read them. Here they are built as one …
Ten sections on the value side of reinforcement learning. Starts with the Q-function and why it is strictly more useful than the value …
What changes for RL when each action is a tool call (search, code execution, calculators, browsers) and the reward only arrives at the end …
Everything that happens between a pretrained base model and the chatbot people end up shipping. Supervised fine tuning, reward modeling, …
PPO is not the only option anymore. DPO, GRPO, ORPO, KTO, and a few REINFORCE variants have all picked up traction for LLM alignment, …
A walkthrough of Proximal Policy Optimization from the ground up. What the clipped ratio is doing, why importance sampling lets you reuse …
End of series.
Back to AI Notebook