RLHF for Language Models
The Alignment Problem
Large language models are trained to predict the next token, but prediction and usefulness are different goals. A model can be a perfect predictor while still generating harmful, unhelpful, or dishonest content.
Reinforcement Learning from Human Feedback (RLHF) bridges this gap by training models to optimize for what humans actually want, not just what comes next statistically.
The MDP Formulation
State Space
The state is the conversation context:
- System prompt (instructions)
- Conversation history (previous turns)
- Current user message
Each state is a sequence of tokens, potentially millions of dimensions when tokenized.
Action Space
The action space is the entire vocabulary at each generation step. For a 50,000 token vocabulary generating 500 tokens, that’s possible responses.
In practice, we treat the complete response as a single “action” and assign rewards at the trajectory level.
Reward Signal
The reward comes from a reward model trained on human preferences:
- Humans compare pairs of model outputs
- A reward model learns to predict which response humans prefer
- This model provides dense reward signals during RL training
Mathematical Details
The RLHF objective optimizes:
where:
- is the learned reward model
- is the supervised fine-tuned model (prevents reward hacking)
- controls how much the policy can deviate from the reference
Challenges
Reward Hacking
Models find unexpected ways to maximize reward without being genuinely helpful:
- Excessive verbosity (longer responses score higher)
- Sycophancy (always agreeing with the user)
- Format gaming (using bullet points excessively)
The KL constraint is critical for preventing this, but tuning is an art.
Evaluation Difficulty
How do we know if the model is actually more aligned? Human evaluation is expensive, and automated metrics often miss subtle issues.
Credit Assignment
A 500-token response gets one reward signal. Which tokens were good? Which were bad? This sparse feedback makes learning slow.
Algorithm Selection
| Algorithm | Pros | Cons | Best For |
|---|---|---|---|
| PPO | Stable, well-understood | Requires critic model, high memory | General RLHF |
| GRPO | No critic needed, memory efficient | Needs more samples | Large models, outcome rewards |
| DPO | No RL at all, just supervised loss | Less flexible than RL | Quick experiments |
| REINFORCE | Simple to implement | High variance | Research, baselines |
See the GRPO paper analysis for how GRPO eliminates the critic.
Implementation Considerations
Implementation
# Simplified RLHF training loop
def rlhf_step(prompts, policy, ref_policy, reward_model, optimizer):
# 1. Generate responses
with torch.no_grad():
responses = policy.generate(prompts, max_tokens=512)
# 2. Score with reward model
rewards = reward_model(prompts, responses)
# 3. Compute KL penalty
policy_logps = policy.get_log_probs(prompts, responses)
ref_logps = ref_policy.get_log_probs(prompts, responses)
kl_penalty = policy_logps - ref_logps
# 4. Compute loss (PPO or GRPO)
loss = -rewards.mean() + beta * kl_penalty.mean()
# 5. Update
loss.backward()
optimizer.step()Key hyperparameters:
beta(KL coefficient): 0.01-0.1 typical- Generation temperature: 0.7-1.0 during training
- Batch size: 128-512 prompts
- Group size (for GRPO): 16-64 samples per prompt
Further Reading
- GRPO Paper Analysis — Critic-free RL for LLMs
- InstructGPT Paper — Original RLHF for ChatGPT
- Constitutional AI — Self-improvement without human labels