RL for Language Models

What You'll Learn

Understand the two motivations for RL in LLMs: alignment and reasoning
Build a reward model from human preferences using the Bradley-Terry model
Compare PPO, DPO, and GRPO — know when to use each
Explain how GRPO with verifiable rewards produces reasoning models
Walk through a real GRPO implementation (Karpathy’s nanochat)
Identify open challenges: reward hacking, mode collapse, scalable oversight

Every modern LLM goes through multiple training stages. Each stage transforms how the model responds. Click through the pipeline to see the difference:

LLM Training Pipeline

Click each stage to see how responses evolve

→

PROMPT

What is the capital of France?

RESPONSEafter Pretraining

The capital of France is Paris, which is also the largest city in the country. Paris is known for its iconic landmarks such as the Eiffel Tower. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is

The model has absorbed vast knowledge from the internet, but it just predicts the next token. It rambles, continues listing facts, and doesn't know when to stop.

Model Capabilities

Helpfulness

25%

Safety

20%

Reasoning

15%

What you just saw: The same model, transformed step by step. Pretraining teaches it language. Supervised fine-tuning teaches it to follow instructions. RL teaches it what humans actually want — and, more recently, how to think through hard problems.

Chapter Overview

In this chapter:

Why RL for Language Models?

The alignment problem, the reasoning revolution, and why supervised learning isn’t enough

Reward Modeling

Learning what humans want from pairwise preferences using the Bradley-Terry model

RL Algorithms for LLMs

PPO, DPO, and GRPO — the evolution from complex to simple, and when to use each

GRPO and the Reasoning Revolution

Verifiable rewards, DeepSeek R1, and how RL produces chain-of-thought reasoning

Building a Reasoning Model

Hands-on with Karpathy’s nanochat implementation and simplified GRPO

Challenges and Frontiers

Reward hacking, mode collapse, Constitutional AI, and what’s next

ℹ️The Journey Comes Full Circle

Everything in this book leads here. Policy gradients taught us how to optimize policies directly — LLMs are policies over tokens. REINFORCE gave us the foundation — GRPO is essentially REINFORCE with group baselines. Actor-Critic methods introduced the advantage function — GRPO replaces the learned critic with group statistics. PPO provided the clipped objective that makes RLHF stable.

RL for language models is where these techniques become a transformative technology.

💡Start Here

Begin with Why RL for Language Models? to understand both motivations — alignment and reasoning — and how the pieces fit together.

Key Takeaways

RL for LLMs has two motivations: alignment (what humans want) and reasoning (developing capabilities)
RLHF uses human preferences to train a reward model, then optimizes with PPO
GRPO eliminates the critic network and trains with verifiable rewards instead
The algorithm evolution — PPO to DPO to GRPO — trends toward simplicity and scalability
DeepSeek R1 showed that RL can produce emergent reasoning capabilities
Open challenges include reward hacking, mode collapse, and aligning systems smarter than us