AI Notebook
A Series · The Notebook

Reinforcement Learning

Policy gradients, value-based methods, and the full post-training pipeline. PPO, DPO, GRPO, Q-learning, and agentic RL — the algorithms that turn a pretrained base model into the assistant you actually use.

7 Stories
~159m Total Read
2026 Last Updated
Nothing matched. Try a shorter query.
Sort

End of series.

Back to AI Notebook