Chapter 304
📝Draft

RL for Language Models

From RLHF to reasoning: how RL transforms language models

RL for Language Models

What You'll Learn

  • Understand the two motivations for RL in LLMs: alignment and reasoning
  • Build a reward model from human preferences using the Bradley-Terry model
  • Compare PPO, DPO, and GRPO — know when to use each
  • Explain how GRPO with verifiable rewards produces reasoning models
  • Walk through a real GRPO implementation (Karpathy’s nanochat)
  • Identify open challenges: reward hacking, mode collapse, scalable oversight

Every modern LLM goes through multiple training stages. Each stage transforms how the model responds. Click through the pipeline to see the difference:

LLM Training Pipeline

Click each stage to see how responses evolve

PROMPT
What is the capital of France?
RESPONSEafter Pretraining
The capital of France is Paris, which is also the largest city in the country. Paris is known for its iconic landmarks such as the Eiffel Tower. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is
The model has absorbed vast knowledge from the internet, but it just predicts the next token. It rambles, continues listing facts, and doesn't know when to stop.
Model Capabilities
Helpfulness
25%
Safety
20%
Reasoning
15%

What you just saw: The same model, transformed step by step. Pretraining teaches it language. Supervised fine-tuning teaches it to follow instructions. RL teaches it what humans actually want — and, more recently, how to think through hard problems.

Chapter Overview

ℹ️The Journey Comes Full Circle

Everything in this book leads here. Policy gradients taught us how to optimize policies directly — LLMs are policies over tokens. REINFORCE gave us the foundation — GRPO is essentially REINFORCE with group baselines. Actor-Critic methods introduced the advantage function — GRPO replaces the learned critic with group statistics. PPO provided the clipped objective that makes RLHF stable.

RL for language models is where these techniques become a transformative technology.

💡Start Here

Begin with Why RL for Language Models? to understand both motivations — alignment and reasoning — and how the pieces fit together.


Key Takeaways

  • RL for LLMs has two motivations: alignment (what humans want) and reasoning (developing capabilities)
  • RLHF uses human preferences to train a reward model, then optimizes with PPO
  • GRPO eliminates the critic network and trains with verifiable rewards instead
  • The algorithm evolution — PPO to DPO to GRPO — trends toward simplicity and scalability
  • DeepSeek R1 showed that RL can produce emergent reasoning capabilities
  • Open challenges include reward hacking, mode collapse, and aligning systems smarter than us