RL for Language Models
What You'll Learn
- Understand the two motivations for RL in LLMs: alignment and reasoning
- Build a reward model from human preferences using the Bradley-Terry model
- Compare PPO, DPO, and GRPO — know when to use each
- Explain how GRPO with verifiable rewards produces reasoning models
- Walk through a real GRPO implementation (Karpathy’s nanochat)
- Identify open challenges: reward hacking, mode collapse, scalable oversight
Every modern LLM goes through multiple training stages. Each stage transforms how the model responds. Click through the pipeline to see the difference:
LLM Training Pipeline
Click each stage to see how responses evolve
What you just saw: The same model, transformed step by step. Pretraining teaches it language. Supervised fine-tuning teaches it to follow instructions. RL teaches it what humans actually want — and, more recently, how to think through hard problems.
Chapter Overview
Everything in this book leads here. Policy gradients taught us how to optimize policies directly — LLMs are policies over tokens. REINFORCE gave us the foundation — GRPO is essentially REINFORCE with group baselines. Actor-Critic methods introduced the advantage function — GRPO replaces the learned critic with group statistics. PPO provided the clipped objective that makes RLHF stable.
RL for language models is where these techniques become a transformative technology.
Begin with Why RL for Language Models? to understand both motivations — alignment and reasoning — and how the pieces fit together.
Key Takeaways
- RL for LLMs has two motivations: alignment (what humans want) and reasoning (developing capabilities)
- RLHF uses human preferences to train a reward model, then optimizes with PPO
- GRPO eliminates the critic network and trains with verifiable rewards instead
- The algorithm evolution — PPO to DPO to GRPO — trends toward simplicity and scalability
- DeepSeek R1 showed that RL can produce emergent reasoning capabilities
- Open challenges include reward hacking, mode collapse, and aligning systems smarter than us