Advanced Topics • Part 1 of 6
📝Draft

Why RL for Language Models?

The alignment problem and the reasoning revolution

Why RL for Language Models?

In 2022, reinforcement learning turned autocomplete engines into conversational partners. In 2025, it taught them to reason. These are two different problems solved by the same tool: policy gradients.

This subsection explains why RL became essential for language models, what it adds that supervised learning cannot, and how the field evolved from proxy rewards to verifiable ones.

Two Motivations, One Tool

Alignment (2022)

Teaching models what humans want. We can’t write a loss function for “helpful” or “harmless,” so we learn one from human preferences.

Key technique: RLHF (PPO + learned reward model)

“Make responses that humans prefer.”

Reasoning (2024-2025)

Developing new capabilities. Instead of human preferences, models train against math problems, code tests, and other tasks with checkable answers.

Key technique: GRPO + verifiable rewards (RLVR)

“Get the right answer, however you get there.”

These two revolutions share a common backbone: policy gradient methods. Everything you learned about REINFORCE and PPO applies directly. The model is a policy. Tokens are actions. Prompts are states. The only difference is the reward signal:

  • Alignment uses a learned reward (a neural network trained on human preferences)
  • Reasoning uses a verifiable reward (did the model get the math problem right?)

The same optimization machinery drives both. But as we will see, these two reward types behave very differently in practice.

The Alignment Problem

📖The Alignment Problem

The challenge of ensuring AI systems pursue goals aligned with human values and intentions. An aligned AI does what humans actually want, not just what maximizes some proxy metric.

Why can’t we just write a loss function for “be helpful”? Consider the challenge:

1.
We can’t specify what we want
”Be helpful” is too vague. “Always answer accurately” fails on dangerous questions. Every rule creates new edge cases.
2.
Supervised data is limited
Human-written ideal responses are expensive. Writers disagree. The model memorizes patterns rather than learning principles.
3.
Human values are contextual
The right answer depends on who is asking, why, and in what context. No fixed set of rules can capture this.

But here is the insight that makes RLHF possible:

📌The Discriminator-Generator Gap

You don’t need to write the perfect response to know which of two responses is better.

Given the prompt “Explain quantum entanglement to a 10-year-old”:

Response A: “Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of one cannot be described independently…”

Response B: “Imagine you have two magic coins. Whenever you flip one and get heads, the other one always lands on tails, no matter how far apart they are…”

Most people immediately prefer B for this audience. But could they write B from scratch? Maybe not. It is easier to judge quality than to produce it.

This is the discriminator-generator gap, and it is the foundation of RLHF. Humans provide comparative judgments. A reward model learns to predict those judgments. RL optimizes the model to produce responses the reward model scores highly.

ℹ️Karpathy's Framing

Andrej Karpathy famously described RLHF as “just barely RL” because the reward signal is a learned proxy, not ground truth. The reward model is a neural network trained on a few hundred thousand comparisons — it captures some of what humans want, but imperfectly. This contrasts sharply with games like Go, where the reward (win/loss) is unambiguous.

We will return to this distinction — proxy vs. verifiable rewards — later in this subsection.

The LLM as an RL Agent

A language model generating a response is an RL agent taking actions in an environment. The mapping is direct:

The RL-LLM Mapping
State sts_t
Prompt + all tokens generated so far
Action ata_t
Next token from vocabulary (~32K-128K choices)
Policy πθ(atst)\pi_\theta(a_t \mid s_t)
The language model itself
Episode
Generating one complete response
Reward rr
Human preference score or correctness check
Mathematical Details

The language model defines a policy over tokens:

πθ(atst)=πθ(tokentprompt,token1,,tokent1)\pi_\theta(a_t \mid s_t) = \pi_\theta(\text{token}_t \mid \text{prompt}, \text{token}_1, \ldots, \text{token}_{t-1})

A complete response y=(a1,a2,,aT)y = (a_1, a_2, \ldots, a_T) is an episode. The probability of generating response yy given prompt xx is:

πθ(yx)=t=1Tπθ(atx,a1,,at1)\pi_\theta(y \mid x) = \prod_{t=1}^{T} \pi_\theta(a_t \mid x, a_1, \ldots, a_{t-1})

The RL objective is to maximize expected reward:

J(θ)=ExD,yπθ(x)[R(x,y)]J(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot \mid x)} \left[ R(x, y) \right]

where R(x,y)R(x, y) is a scalar reward for the complete response. This is exactly the policy gradient setup from REINFORCE, with tokens as actions.

</>Implementation
import torch
import torch.nn.functional as F

def compute_response_log_prob(model, prompt_ids, response_ids):
    """
    Compute log π_θ(response | prompt).

    In RL terms: the log probability of this episode
    under the current policy.
    """
    # Concatenate prompt and response
    input_ids = torch.cat([prompt_ids, response_ids], dim=-1)

    # Forward pass through the language model
    outputs = model(input_ids)
    logits = outputs.logits

    # Get log probabilities for the response tokens only
    # Shift: logits[t] predicts token[t+1]
    prompt_len = prompt_ids.shape[-1]
    response_logits = logits[:, prompt_len - 1:-1, :]  # Predictions for response tokens
    log_probs = F.log_softmax(response_logits, dim=-1)

    # Gather the log prob of each actual response token
    token_log_probs = log_probs.gather(
        dim=-1,
        index=response_ids.unsqueeze(-1)
    ).squeeze(-1)

    # Sum over tokens: log π(y|x) = Σ log π(a_t|s_t)
    return token_log_probs.sum(dim=-1)
💡Why Token-Level Actions?

You might wonder: why not treat the entire response as a single action? Because the action space would be astronomically large — every possible sequence of tokens. By treating each token as a separate action, we can compute gradients through the autoregressive structure, just like any sequential decision problem.

The Three Training Stages

Every modern LLM goes through a pipeline of three stages. Each stage adds a different capability, and you cannot skip stages.

The LLM Training Pipeline
Stage 1
Pretraining
Predict next token on trillions of words from the internet. Learns language, facts, and patterns.
Adds: Knowledge
Stage 2
SFT
Fine-tune on human-written examples of ideal assistant behavior. Learns format and instruction-following.
Adds: Instruction-following
Stage 3
RL
Optimize against a reward signal — either human preferences (RLHF) or verifiable answers (RLVR).
Adds: Alignment / Reasoning

Each stage builds on the previous one. Skipping stages produces poor results.

Why can’t you skip stages?

Skip pretraining?
The model doesn’t know language. You can’t align something that can’t form sentences.
Skip SFT?
The model knows language but not how to be an assistant. It might continue your prompt as a novel.
Skip RL?
The model follows instructions but inconsistently. It may be helpful one moment and harmful the next.

Think of it like training a doctor: medical school (pretraining) teaches knowledge, residency (SFT) teaches patient interaction, and peer review (RL) teaches judgment.

Mathematical Details

Each stage has a different training objective:

Stage 1 — Pretraining (next-token prediction):

LPT(θ)=ExDtext[t=1Tlogπθ(xtx1,,xt1)]L_{\text{PT}}(\theta) = -\mathbb{E}_{x \sim D_{\text{text}}} \left[ \sum_{t=1}^{T} \log \pi_\theta(x_t \mid x_1, \ldots, x_{t-1}) \right]

Stage 2 — Supervised Fine-Tuning (imitate demonstrations):

LSFT(θ)=E(x,y)Ddemo[logπθ(yx)]L_{\text{SFT}}(\theta) = -\mathbb{E}_{(x, y^*) \sim D_{\text{demo}}} \left[ \log \pi_\theta(y^* \mid x) \right]

Stage 3 — RL (maximize reward):

JRL(θ)=ExD,yπθ[R(x,y)βDKL(πθπref)]J_{\text{RL}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta} \left[ R(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

The KL divergence term in Stage 3 prevents the model from drifting too far from the SFT model. Without it, the model can exploit weaknesses in the reward signal.

Timeline of Key Developments

2017
Deep RL from Human Preferences
Christiano et al. show that agents can learn complex behaviors from human comparative feedback, applied to Atari games and simulated robotics.
2020
GPT-3 and Emergent Capabilities
175 billion parameters. Few-shot learning emerges. But the model is unreliable, sometimes toxic, and hard to steer.
2022
InstructGPT and ChatGPT
OpenAI applies RLHF at scale. InstructGPT paper shows the three-stage pipeline. ChatGPT launches and reaches 100 million users in two months.
2023
DPO Simplifies the Pipeline
Rafailov et al. show that the reward model can be eliminated entirely, optimizing preferences directly. Simpler, more stable, widely adopted.
2024
DeepSeekMath Introduces GRPO
Group Relative Policy Optimization eliminates the critic network, using group statistics as baselines instead. Efficient and effective for math reasoning.
2025
DeepSeek R1 and the Reasoning Revolution
Pure RL produces emergent chain-of-thought reasoning. RLVR (RL from Verifiable Rewards) emerges as a leading approach for capability improvement. Models learn to think step by step without being told how.
ℹ️The Arc of Simplification

Notice the trend: each major advance simplifies the pipeline. PPO-based RLHF requires four models in memory (policy, reference, critic, reward model). DPO eliminates the reward model. GRPO eliminates the critic. The algorithms are getting simpler, not more complex — and producing better results.

Proxy vs. Verifiable Rewards

This is the most important conceptual distinction in the field. It determines how far you can push optimization, how much you can trust the results, and which failure modes to worry about.

Proxy Rewards (RLHF)

A neural network predicts what humans would prefer. Trained on ~100K comparisons.

The reward model is an approximation. Push too hard and the model finds ways to score high without actually being helpful.

+Works for subjective tasks (writing, chat)
-Can be gamed (reward hacking)
-Requires KL penalty as safety net
Verifiable Rewards (RLVR)

The answer is checked against ground truth. Did the math come out right? Does the code pass the tests?

The reward is exact. You can optimize aggressively because there is no proxy to exploit.

+No reward hacking — ground truth
+Can optimize aggressively
-Only works for tasks with checkable answers

Karpathy captures this distinction perfectly: “RLHF is just barely RL” because the reward is a “crappy proxy” — a neural network that approximates human preferences but can be fooled. The model might learn to produce verbose, confident-sounding responses that score well without being genuinely helpful.

Verifiable rewards are “real” RL — closer to AlphaGo. When the reward is “did you get the right answer to this math problem?”, there is no shortcut. The model must actually develop reasoning capabilities to improve.

This is why the shift from RLHF to RLVR was so important. It turned language model training from a fuzzy optimization problem into something closer to the classical RL successes we studied earlier in this book.

Mathematical Details

The distinction shows up in the reward function:

Proxy reward (RLHF):

Rproxy(x,y)=rϕ(x,y)R_{\text{proxy}}(x, y) = r_\phi(x, y)

where rϕr_\phi is a learned reward model. The objective must include a KL constraint to prevent exploitation:

J(θ)=E[rϕ(x,y)βDKL(πθπref)]J(\theta) = \mathbb{E}\left[ r_\phi(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

Verifiable reward (RLVR):

Rverify(x,y)=1[extract_answer(y)=ground_truth(x)]R_{\text{verify}}(x, y) = \mathbb{1}[\text{extract\_answer}(y) = \text{ground\_truth}(x)]

This is a binary signal: 1 if correct, 0 if wrong. No learned model, no approximation, no hacking. The KL penalty is still used for stability, but can be set much lower because the reward is trustworthy.

Why Not Just Use Supervised Learning?

If we have examples of good responses, why not just train on those? Supervised Fine-Tuning (SFT) does exactly this — and it helps. But it has fundamental limitations that RL overcomes.

SFT: Learning by Imitation

Trains on human-written examples. The model learns to copy the distribution of responses it sees.

Limited by the quality and diversity of the training data. Cannot exceed the performance of its teachers.

“Repeat what the expert said.”

RL: Learning by Exploration

Generates its own responses and learns from feedback. The model discovers strategies that no human demonstrator showed it.

Can surpass human performance because it explores beyond the training distribution.

“Find something even better.”

📌The Ceiling Problem

Suppose you hire 100 writers to produce ideal responses for math problems. The best writer solves 80% of problems correctly. SFT trains the model to match these writers — so the model’s ceiling is roughly 80%.

With RL and verifiable rewards, the model generates many attempts, keeps what works, and discards what doesn’t. It might discover solution strategies no individual writer used. DeepSeek R1 achieved math performance that exceeds most human demonstrators, precisely because RL is not bounded by the quality of demonstrations.

This is the same dynamic that made AlphaGo superhuman: RL lets the system discover strategies beyond what any human teacher could provide.

Key Takeaways

  • RL for LLMs serves two purposes: alignment (teaching values from human preferences) and reasoning (developing capabilities from verifiable rewards)
  • The discriminator-generator gap makes RLHF possible: it is easier to judge quality than to produce it
  • A language model is an RL policy: state = context, action = next token, episode = generating a response
  • The three training stages (pretraining, SFT, RL) each add distinct capabilities and cannot be skipped
  • Proxy rewards (RLHF) enable subjective tasks but can be gamed; verifiable rewards (RLVR) enable aggressive optimization on tasks with checkable answers
  • RL can surpass supervised learning because it explores beyond the demonstration distribution
💡Next: Reward Modeling

If you are familiar with REINFORCE and PPO, you have all the prerequisites. The next subsection on Reward Modeling introduces the Bradley-Terry model and preference data collection — the foundation for everything that follows.