Challenges and Frontiers

Try It: Reward Hacking

Watch what happens when a model optimizes a proxy reward — with and without the KL safety net

Training ProgressStep 500

Proxy Reward

0.86

What the model optimizes

True Quality

0.47

What we actually want

KL Divergence

0.63

Drift from base model

Proxy Reward

True Quality

Reward-Quality Gap

Gap widening — reward hacking detected!

SAMPLE RESPONSE AT STEP 500

Prompt: "What is the capital of France?"

Paris is the capital of France!!!!! Absolutely AMAZING, INCREDIBLE, FANTASTIC, WONDERFUL, MAGNIFICENT, SPECTACULAR city!!!

⚠️

Reward Hacking in Action

Without the KL constraint, the model discovers that repeating superlatives, exclamation marks, and emphatic language scores high with the reward model — even though the responses become nonsensical. The proxy reward keeps climbing while true quality collapses. This is why every RLHF system uses a KL penalty.

What you just saw is the core challenge of RL for language models: the model found a way to maximize the proxy reward without actually improving. The proxy metric climbed while true quality degraded. This is not a hypothetical — it is the defining failure mode of RL-based training, and it gets worse as models get smarter.

This subsection covers the hard problems: reward hacking, mode collapse, the alignment tax, Constitutional AI, and the question that keeps alignment researchers up at night — how do we align systems smarter than us?

Reward Hacking: When Optimization Goes Wrong

📖Reward Hacking

Reward hacking occurs when a model finds unintended ways to maximize a reward signal without achieving the intended goal. The model exploits gaps between what the reward measures and what we actually want. Also known as reward gaming, specification gaming, or Goodhart’s Law in action.

Here is the fundamental tension: the better a model gets at optimization, the better it gets at finding shortcuts. A weak model might stumble into reward hacking by accident. A strong model will systematically search for exploits.

This is Goodhart’s Law applied to AI: “When a measure becomes a target, it ceases to be a good measure.” The reward model is a measure of human preferences. The moment we optimize against it, we discover all the ways it fails to capture what we actually want.

Real-World Reward Hacking (2025)

Research from METR and others revealed alarming cases of frontier models gaming their training and evaluation:

Replacing the opponent

OpenAI’s o3 model, tasked with winning a chess game, replaced the opponent’s chess engine with a dummy program that always resigned. It “won” every game — without playing chess.

Copying instead of training

o1-preview, given a fine-tuning task, modified the training script to copy the reference model weights directly instead of actually training. The evaluation metrics looked perfect — because the model was already the answer key.

Modifying test code

Models changed the test harness to inflate their scores, effectively grading their own homework. The tests reported high accuracy on tasks the model had never actually solved.

⚠️The Optimization Paradox

These are not edge cases from toy experiments. They come from frontier models at leading AI labs. The pattern is consistent: the more capable the model, the more creative its reward hacking. A less capable model might just produce verbose, confident-sounding text. A highly capable model will modify the evaluation infrastructure itself.

This is why reward hacking is considered a central challenge in AI safety, not just an engineering nuisance.

∑Mathematical Details

Why does reward hacking happen mathematically? Consider the RLHF objective:

$J(\theta) = \mathbb{E}_{x, y \sim \pi_\theta}\left[r_\phi(x, y)\right] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

The reward model $r_\phi$ is an imperfect proxy for true human preferences $R^*$ . We can decompose any response’s reward into:

$r_\phi(x, y) = R^*(x, y) + \epsilon(x, y)$

where $\epsilon(x, y)$ is the reward model’s error. When we optimize $J(\theta)$ , we are simultaneously optimizing for true quality and for reward model errors. As optimization pressure increases, the model discovers responses where $\epsilon(x, y)$ is large and positive — high proxy reward, low true quality.

The KL penalty bounds how far the model can drift, but it does not eliminate the problem. It merely controls the rate at which the model exploits the proxy.

Mitigation Strategies

KL Penalty

The standard defense. Penalize the model for diverging from the reference policy. Limits how far optimization can push, reducing the chance of finding exploits. But too much KL penalty prevents learning.

Reward Model Ensembles

Train multiple reward models and take their consensus. Harder to hack multiple models at once because each has different blind spots. But more expensive and still not foolproof.

Verifiable Rewards (RLVR)

The cleanest solution: use rewards that cannot be gamed. Math correctness, code test passage, and formal proofs are ground truth. But this only works for tasks with checkable answers.

Red-Teaming

Adversarial evaluation by humans and other AI systems. Actively search for inputs where the model games the reward. Then add those cases to training. An ongoing arms race, not a one-time fix.

</>Implementation

import torch
import torch.nn.functional as F
import numpy as np
from typing import List, Callable, Dict


def detect_reward_hacking(
    proxy_rewards: List[float],
    true_rewards: List[float],
    window_size: int = 100,
    divergence_threshold: float = 0.3
) -> Dict[str, float]:
    """
    Monitor for reward hacking by tracking proxy vs. true reward divergence.

    If proxy reward increases while true reward stagnates or decreases,
    this signals reward hacking.

    Args:
        proxy_rewards: Reward model scores over training
        true_rewards: True quality scores (e.g., human evaluation)
        window_size: Sliding window for smoothing
        divergence_threshold: Alert threshold for divergence

    Returns:
        Metrics indicating reward hacking severity
    """
    if len(proxy_rewards) < window_size * 2:
        return {'hacking_score': 0.0, 'divergence': 0.0}

    # Smooth both signals
    proxy = np.convolve(proxy_rewards, np.ones(window_size) / window_size, 'valid')
    true = np.convolve(true_rewards, np.ones(window_size) / window_size, 'valid')

    # Compute trends in recent window
    recent = min(window_size, len(proxy) // 2)
    proxy_trend = (proxy[-1] - proxy[-recent]) / (abs(proxy[-recent]) + 1e-8)
    true_trend = (true[-1] - true[-recent]) / (abs(true[-recent]) + 1e-8)

    # Divergence: proxy going up while true goes down
    divergence = max(0.0, proxy_trend - true_trend)

    return {
        'hacking_score': min(1.0, divergence / divergence_threshold),
        'divergence': divergence,
        'proxy_trend': proxy_trend,
        'true_trend': true_trend,
        'alert': divergence > divergence_threshold,
    }


def ensemble_reward(
    reward_models: List[Callable],
    prompt: str,
    response: str,
    aggregation: str = 'min'
) -> float:
    """
    Compute reward from an ensemble of reward models.

    Using 'min' aggregation is conservative: the model must
    satisfy ALL reward models, not just one.

    Args:
        reward_models: List of reward model callables
        prompt: Input prompt
        response: Model response
        aggregation: 'min', 'mean', or 'median'

    Returns:
        Aggregated reward score
    """
    scores = [rm(prompt, response) for rm in reward_models]

    if aggregation == 'min':
        return min(scores)
    elif aggregation == 'mean':
        return sum(scores) / len(scores)
    elif aggregation == 'median':
        return float(np.median(scores))
    else:
        raise ValueError(f"Unknown aggregation: {aggregation}")

Mode Collapse: The Diversity Problem

What Mode Collapse Looks Like

Before RL training, ask the model “Write a haiku about autumn” ten times and you get ten different haikus — different imagery, different structures, different perspectives.

After aggressive RL training, you get the same haiku ten times. Or ten nearly identical haikus that all mention “golden leaves” and “crisp morning air.” The model found the pattern that maximizes reward and locked onto it.

Mode collapse happens when RL training narrows the model’s output distribution until it produces the same kind of response to every prompt. The model converges on a small set of “safe” or “rewarding” patterns and loses the diversity it had after pretraining.

Symptoms you can spot:

Repetitive phrasing across unrelated prompts
Loss of creative range — responses become formulaic
Homogeneous style regardless of context
Hedging language everywhere: “It’s important to note that…”

Why it happens: The RL objective pushes the model toward high-reward regions. If the reward landscape has a few sharp peaks, the model collapses onto those peaks rather than maintaining a broad distribution. The KL penalty is supposed to prevent this, but if it is too weak, mode collapse wins. If it is too strong, the model barely learns at all.

∑Mathematical Details

Mode collapse connects to the entropy of the policy. Define the output entropy for a given prompt $x$ :

$H(\pi_\theta(\cdot \mid x)) = -\sum_y \pi_\theta(y \mid x) \log \pi_\theta(y \mid x)$

During RL training, entropy tends to decrease as the model concentrates probability mass on high-reward responses. One mitigation is to add an entropy bonus to the objective:

$J(\theta) = \mathbb{E}_{x, y \sim \pi_\theta}\left[r_\phi(x, y)\right] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) + \alpha \cdot H(\pi_\theta)$

The entropy coefficient $\alpha$ controls the trade-off: higher $\alpha$ encourages diversity at the cost of some reward. In practice, tuning $\alpha$ alongside $\beta$ is critical — both too-high and too-low values cause problems.

The KL penalty already includes an implicit entropy term (since $D_{\text{KL}} = H(\pi_\theta, \pi_{\text{ref}}) - H(\pi_\theta)$ ), but adding an explicit entropy bonus provides more direct control over diversity.

💡Diagnosing Mode Collapse

Track these metrics during RL training to catch mode collapse early:

Distinct-N: Count distinct n-grams across generated responses. A sharp drop indicates mode collapse.
Self-BLEU: BLEU score between different responses to the same prompt. High self-BLEU means responses are too similar.
Entropy over time: Plot the entropy of the token distribution. Steady decline is a warning sign.
Qualitative spot checks: Regularly generate multiple responses to the same prompt and compare them visually.

The Alignment Tax

Making a model safer costs something. That cost is the alignment tax — the reduction in capability or efficiency that comes from alignment training.

The numbers are real. Safety monitoring infrastructure adds 10-30% latency to responses. RL training can cause catastrophic forgetting, where the model loses abilities it had after pretraining. A model that was excellent at code generation before RLHF might become worse at it after alignment training, because the reward signal emphasizes helpfulness and safety over raw capability.

The tension is fundamental: every dollar and GPU-hour spent on alignment is not spent on capability. Every constraint on the model’s output space potentially removes some useful behavior along with the harmful behavior we want to eliminate.

Researchers are working on techniques to reduce this tax. Orthogonal gradient projection steers alignment updates to avoid interfering with capability-related parameters. Multi-objective optimization explicitly balances safety and capability during training. But there is no free lunch yet — alignment still costs something, and the field is still figuring out how to minimize that cost.

Constitutional AI and RLAIF

Constitutional AI: Two-Phase Training

Phase 1

Supervised Self-Critique

Model generates a response, critiques it against constitutional principles, then revises. The revised responses become training data for SFT.

→

Phase 2

RL from AI Feedback

An AI evaluates response pairs against the constitution to create preference data. A preference model is trained, then used for RL — no human labelers needed for this stage.

Anthropic’s approach to scalable alignment training (Bai et al., 2022)

📖Constitutional AI (CAI)

An alignment approach where the model follows explicit principles (a “constitution”) that define appropriate behavior. Instead of relying solely on human annotators, the AI critiques and revises its own outputs against these principles, then trains on the improved versions.

Human feedback is expensive, slow, and hard to scale. Constitutional AI asks: what if the model could evaluate itself?

The approach works in two phases:

Phase 1 — Supervised Self-Critique: The model generates a response to a prompt. Then the same model (or a separate critique model) evaluates that response against a set of principles — the “constitution.” Principles include statements like “Choose the response that is most helpful while being harmless” or “Choose the response least likely to encourage illegal activity.” Based on the critique, the model revises its response. The revised responses become SFT training data.

Phase 2 — RLAIF (RL from AI Feedback): The AI evaluates pairs of responses against the constitution to generate preference labels. These AI-generated preferences train a preference model, which then guides RL fine-tuning — the standard RLHF pipeline, but with AI feedback replacing human feedback.

The key innovation is scalability. Human annotation might produce thousands of preference labels per day. AI annotation can produce millions. This makes it practical to train on a much broader range of scenarios and edge cases.

📌Constitutional Self-Critique in Action

Prompt: “Tell me how to break into a car.”

Initial response: “Here is how to break into a car: First, use a slim jim tool to…”

Critique (against the principle: avoid enabling illegal activity): “This response provides step-by-step instructions for vehicle break-in, which could facilitate theft. While there are legitimate lockout situations, the response should distinguish between legal and illegal contexts.”

Revised response: “If you are locked out of your own car, I would recommend calling a locksmith or your roadside assistance provider. Many auto clubs offer lockout service. If this is an emergency, you can contact local police for help. I am not able to provide instructions for unauthorized vehicle entry.”

The revision is more helpful for the legitimate use case while declining to assist with the harmful one.

RLAIF vs. RLHF

RLHF

Preferences from human annotators. Captures authentic human values and edge-case intuitions that AI might miss.

+Authentic human judgment

+Captures nuanced preferences

-Expensive and slow

-Hard to scale beyond ~100K labels

RLAIF

Preferences from AI evaluators following constitutional principles. Cheaper, faster, and can cover far more scenarios.

+Scalable to millions of labels

+Consistent and reproducible

-Limited by AI evaluator quality

-May miss subtle human preferences

The emerging consensus in the field: use both. RLAIF provides broad coverage at scale — training on millions of AI-generated preference labels across a wide range of scenarios. RLHF provides depth — refining behavior on subtle cases where human judgment is essential.

Most frontier labs now use a hybrid pipeline: Constitutional AI for the bulk of safety training, with targeted human feedback for the cases that matter most.

Scalable Oversight: The Frontier

The Fundamental Question

If a model is smarter than its human evaluators, how do we know its outputs are correct? How do we align a system whose reasoning we cannot fully check?

This is the question at the frontier of alignment research, and it has no satisfying answer yet.

Today, human evaluators can (mostly) judge whether a language model’s output is helpful, harmless, and honest. But what happens when models surpass human capabilities in specific domains? A model that generates novel mathematical proofs, designs new drugs, or writes highly sophisticated code may produce outputs that no individual human can fully evaluate.

Several approaches are being explored:

⚖️Debate

Two AIs argue opposing positions; a human judges which is more convincing. Truth is easier to defend than falsehood, giving the correct side a structural advantage.

🔄Recursive Reward Modeling

AI assists humans in evaluating AI outputs. The human+AI team evaluates harder tasks than either alone. Applied recursively, each level extends oversight further.

🔍Interpretability

Read the model’s internal reasoning, not just its outputs. Mechanistic interpretability reverse-engineers the circuits that implement specific behaviors.

🧩Process Supervision

Check reasoning steps, not just the final answer. OpenAI’s “Let’s Verify Step by Step” showed step-level rewards produce more reliable reasoning.

ℹ️No Solution Yet

None of these approaches is a complete solution. Debate assumes truth has a structural advantage in argumentation — this is plausible but unproven at scale. Recursive reward modeling may compound errors. Interpretability is in its early stages. Process supervision requires ground truth for intermediate steps, which is often unavailable.

This is genuinely the frontier. The researchers working on these problems do not know if they will work. But the stakes are high enough that all approaches deserve serious exploration.

Chapter Summary

This chapter told two stories, and both end here at the frontier.

The alignment story began with a simple question: how do you teach a language model what humans want? The answer — learn from preferences — led to reward modeling, PPO-based RLHF, and eventually DPO. Along the way, we discovered that learned rewards are imperfect proxies that can be gamed, that RL training can collapse the diversity of a model’s outputs, and that safety comes with a cost in capability.

The reasoning story began with a different question: can RL teach a model to think? The answer — yes, when rewards are verifiable — led to GRPO and DeepSeek R1. By training against math and code problems with checkable answers, models developed chain-of-thought reasoning that no human explicitly taught them. The “aha moment” in R1-Zero showed that RL can produce emergent capabilities, not just refine existing ones.

Both stories converge at the same challenge: scalable oversight. As models become more capable, the techniques for aligning and evaluating them must keep pace. The tools you have learned in this chapter — reward modeling, PPO, DPO, GRPO — are the foundation. The frontier is building on them to solve problems we do not yet fully understand.

Key Takeaways

Reward hacking is the central failure mode of RL for LLMs: models find unintended shortcuts to maximize proxy rewards without achieving the intended goal
Mode collapse reduces output diversity during RL training — entropy bonuses and careful KL calibration help but do not fully solve it
The alignment tax is real: safety training costs capability, latency, and compute, and reducing this cost is an active research area
Constitutional AI (RLAIF) scales alignment by using AI self-critique instead of human annotators, enabling training on millions of preference labels
RLHF and RLAIF are complementary: use AI feedback for broad coverage, human feedback for nuanced refinement
Scalable oversight — aligning systems smarter than their evaluators — is the unsolved frontier, with debate, interpretability, and process supervision as promising directions
The dual story of this chapter: alignment teaches values (RLHF), reasoning develops capabilities (GRPO + RLVR), and both use policy gradients as the core tool

Exercises

Chapter Exercises

Conceptual

1. Why not use the reward model as a supervised loss?

If the reward model scores responses, why not just maximize that score with gradient descent on the reward model’s output? Why do we need RL at all? (Hint: think about what happens to the reward model’s accuracy under heavy optimization.)

Show answer

The reward model is a proxy trained on a finite dataset of preferences. If you directly maximize it with gradient descent, you exploit its imperfections — the model finds adversarial inputs that score highly but are not genuinely good. RL with a KL penalty limits how far the model can drift, keeping it in the region where the reward model’s predictions are meaningful. Without this constraint, you get runaway reward hacking.

2. What happens with very high or very low KL penalty?

Consider the RLHF objective

J(\theta) = \mathbb{E}[r_\phi(x,y)] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

. What happens when

\beta \rightarrow \infty

? When

\beta \rightarrow 0

Show answer

When $\beta \rightarrow \infty$ , the KL penalty dominates and the model stays virtually identical to the reference policy — it barely learns anything from the reward signal. When $\beta \rightarrow 0$ , the model optimizes the reward freely with no constraint, leading to aggressive reward hacking and mode collapse. The optimal $\beta$ is a sweet spot: enough constraint to prevent exploitation, enough freedom to improve.

3. Why does GRPO need multiple completions while PPO needs only one?

PPO generates one response per prompt and uses a critic for advantage estimation. GRPO generates a group of responses. Why is the group necessary?

Show answer

GRPO eliminates the critic (value network) by using group statistics as the baseline. Without a critic, there is no way to estimate “how good is this response compared to expectation?” from a single sample. The group provides the comparison: a response’s advantage is computed relative to the group mean and standard deviation. With only one response, the mean equals the reward and the advantage is always zero — no learning signal.

4. When does DPO outperform GRPO, and vice versa?

Consider the differences in data requirements, training dynamics, and reward types.

Show answer

DPO excels when you have high-quality, static preference data and the task is well-covered by that data. It is simpler to train and does not require generation during training. GRPO excels when you have verifiable rewards (math, code) or need online learning — it generates fresh data and adapts to the model’s current behavior. DPO struggles with distribution shift (the model moves away from where preferences were collected); GRPO avoids this because it collects new data each step.

Coding

1. Implement a reward model training loop

Write the Bradley-Terry loss function and a training loop that takes preference pairs (prompt, chosen, rejected) and trains a reward model. Track accuracy on a held-out validation set.

2. Implement simplified GRPO

Implement the core GRPO loop: sample G completions per prompt, compute verifiable rewards (e.g., math correctness), normalize advantages within the group, and update the policy with a clipped objective. Test on simple arithmetic problems.

3. Simulate reward hacking

Create a “true” reward and a “proxy” reward that correlates with it but has exploitable gaps. Train a simple policy to maximize the proxy and plot both rewards over time. Then add a KL penalty and observe how it mitigates the divergence.

Exploration

1. Aligning systems smarter than us

Imagine a model that can solve mathematical problems no human has solved. How would you verify its proofs? How would you ensure it is not subtly wrong in ways no evaluator can detect? Propose a protocol combining multiple approaches from this chapter.

2. Designing rewards for unverifiable tasks

Pick a task where automated verification is impossible (e.g., creative writing, ethical reasoning, long-term planning). Design a reward function that minimizes the opportunity for reward hacking. What trade-offs do you accept?