RL Algorithms for LLMs
Algorithm Explorer
Compare PPO, DPO, and GRPO — select a model size to see memory requirements
| PPO | DPO | GRPO | |
|---|---|---|---|
| Models in memory | 4 | 2 | 2 |
| Data type | Online | Offline pairs | Online |
| Reward source | Learned RM | Implicit | Verifiable |
| Critic needed | Yes | No | No |
| Best for | Maximum control, established pipeline | Simple alignment with existing preference data | Reasoning tasks with checkable answers |
The visualization above lets you compare three algorithms that shape modern language models. Each represents a different answer to the same question: how do you optimize a language model using a reward signal? The answer has evolved from the heavyweight machinery of PPO toward the elegant simplicity of GRPO, driven by a relentless push to reduce memory, complexity, and instability.
PPO for Language Models: The Classic Approach
This section assumes familiarity with PPO’s clipped objective. For the full derivation, see the PPO chapter. Here we focus on what changes when the policy is a language model instead of an Atari agent.
PPO applied to language models optimizes a KL-regularized reward objective: generate responses that score highly under a reward model while staying close to a reference policy. The objective balances reward maximization against distributional drift.
You already know how PPO works for games: collect experience, compute advantages, clip the policy ratio, update. For language models the mechanics are identical, but the scale changes everything.
The RLHF objective with KL penalty:
Without the KL penalty, the model quickly learns to exploit reward model weaknesses. It might generate repetitive praise, absurdly long responses, or text that “sounds confident” without being correct. The KL term says: “improve, but don’t become something completely different.”
Per-token reward decomposition:
The KL penalty applies at every token, while the reward model score arrives only at the end:
- For tokens :
- For the final token :
This creates a dense reward signal from a sparse one. The KL penalty on every token provides continuous feedback, while the reward model score at the end provides the learning signal about response quality.
Credit assignment through the value function:
A value function estimates expected future reward from any point in generation:
The advantage tells us: “Was this token better or worse than expected?” This is critical for learning which tokens in a response actually contributed to its quality.
The Four-Model Problem
PPO for RLHF requires four models in GPU memory simultaneously:
- Policy model — the model being trained
- Reference model — frozen copy for KL computation
- Value function — the critic for advantage estimation
- Reward model — scores complete responses
For a 7B parameter model, the memory budget looks like this:
This is why PPO for LLMs is an infrastructure challenge as much as an algorithmic one. Training a 70B model with PPO requires clusters of high-end GPUs. The question that drives the rest of this section: can we get the same results with fewer models?
The Critic Problem
In standard RL, the critic is tiny compared to the policy. A CartPole critic has a few hundred parameters. An Atari critic has a few million. The cost of the critic is negligible.
In LLM RL, the critic IS another language model — the same size as the policy. It needs to understand language well enough to predict “how good will this partially-generated response turn out?” That requires billions of parameters.
Worse, training this critic on sparse, sequence-level rewards is inherently difficult. The reward arrives once at the end of generation. The critic must learn to predict this final score from partial sequences, which requires many training samples and careful optimization.
This motivates two complementary strategies:
- DPO: eliminate the reward model entirely by folding preferences into a supervised loss
- GRPO: eliminate the critic by using group statistics as a baseline instead
DPO: Direct Preference Optimization
DPO bypasses RL entirely by exploiting a mathematical insight: the optimal policy under the RLHF objective has a closed-form relationship with the reward function. This lets us reparameterize the reward in terms of the policy itself, turning RL into a supervised learning problem on preference pairs.
DPO starts from a remarkable observation. The RLHF objective with KL penalty has an exact solution:
If you could compute this, you would not need RL at all. You would just adjust the reference policy by the exponentiated reward. The problem is that you do not know the reward function — that is what the reward model learns.
DPO’s trick: rearrange this equation to express the reward in terms of the policy:
where is a constant that depends only on the prompt , not the response .
Now substitute this “implicit reward” into the Bradley-Terry preference model. The constant cancels (it appears in both the chosen and rejected terms), and you get a loss that depends only on the policy and the reference model — no reward model needed.
What the DPO loss does intuitively:
- Increase the probability of the preferred response (relative to the reference)
- Decrease the probability of the rejected response (relative to the reference)
- The reference model acts as an anchor, preventing the policy from collapsing
Deriving the DPO loss:
Start from the optimal policy under KL-constrained reward maximization:
Solve for the reward:
Under the Bradley-Terry model, the preference probability is:
Substituting the implicit reward (the terms cancel):
The DPO loss is the negative log-likelihood of the observed preferences:
Gradient analysis:
The gradient of the DPO loss increases the log-probability of preferred responses and decreases it for rejected responses, weighted by how “surprising” the current preference is. When the model already strongly prefers the correct response, the gradient is small. When it gets the preference wrong, the gradient is large.
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
from typing import List, Dict, Tuple
@dataclass
class DPOConfig:
"""Configuration for DPO training."""
beta: float = 0.1 # KL penalty strength
learning_rate: float = 1e-6
max_length: int = 512
epochs: int = 3
def compute_sequence_log_probs(
model: nn.Module,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
response_start: int
) -> torch.Tensor:
"""
Compute log-probability of the response tokens.
Args:
model: Language model
input_ids: Full sequence [batch, seq_len]
attention_mask: Mask [batch, seq_len]
response_start: Index where response begins
Returns:
Log-probabilities summed over response tokens [batch]
"""
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
# Shift for autoregressive prediction
shift_logits = logits[:, response_start - 1:-1, :]
shift_labels = input_ids[:, response_start:]
shift_mask = attention_mask[:, response_start:]
log_probs = F.log_softmax(shift_logits, dim=-1)
token_log_probs = log_probs.gather(
dim=-1, index=shift_labels.unsqueeze(-1)
).squeeze(-1)
# Sum log-probs over response tokens
return (token_log_probs * shift_mask).sum(dim=-1)
def dpo_loss(
policy: nn.Module,
ref_policy: nn.Module,
chosen_ids: torch.Tensor,
chosen_mask: torch.Tensor,
rejected_ids: torch.Tensor,
rejected_mask: torch.Tensor,
response_start: int,
beta: float = 0.1
) -> Tuple[torch.Tensor, Dict[str, float]]:
"""
Compute DPO loss on a batch of preference pairs.
Returns:
loss: Scalar DPO loss
metrics: Training metrics
"""
# Policy log-probs
pi_chosen = compute_sequence_log_probs(
policy, chosen_ids, chosen_mask, response_start
)
pi_rejected = compute_sequence_log_probs(
policy, rejected_ids, rejected_mask, response_start
)
# Reference log-probs (no gradients)
with torch.no_grad():
ref_chosen = compute_sequence_log_probs(
ref_policy, chosen_ids, chosen_mask, response_start
)
ref_rejected = compute_sequence_log_probs(
ref_policy, rejected_ids, rejected_mask, response_start
)
# DPO logits: beta * (log_ratio_chosen - log_ratio_rejected)
logits = beta * (
(pi_chosen - ref_chosen) - (pi_rejected - ref_rejected)
)
loss = -F.logsigmoid(logits).mean()
# Metrics
with torch.no_grad():
accuracy = (logits > 0).float().mean().item()
chosen_reward = beta * (pi_chosen - ref_chosen).mean().item()
rejected_reward = beta * (pi_rejected - ref_rejected).mean().item()
metrics = {
'loss': loss.item(),
'accuracy': accuracy,
'chosen_reward': chosen_reward,
'rejected_reward': rejected_reward,
'reward_margin': chosen_reward - rejected_reward,
}
return loss, metricsWhen DPO works well:
- You have high-quality paired preference data
- The preference labels are clean and consistent
- The task is well-covered by existing data
When DPO struggles:
- Distribution shift: DPO trains on pre-collected data. If the policy moves far from where the data was collected, the preferences may no longer be meaningful.
- Noisy labels: DPO can overfit to annotation noise because it treats every preference as ground truth.
- No online learning: You cannot generate new responses and get fresh feedback during training. The model only learns from the fixed dataset.
These limitations motivate online methods like PPO and GRPO, which generate fresh data during training and can adapt to the policy’s current behavior.
GRPO: Group Relative Policy Optimization
GRPO eliminates the critic by replacing it with group statistics. For each prompt, sample a group of completions, score them, and use the group mean and standard deviation as a baseline. This turns sparse rewards into normalized advantages without training a value function.
GRPO is essentially REINFORCE with a group-based baseline. If the connection to REINFORCE feels unfamiliar, review the REINFORCE chapter, especially the section on variance reduction through baselines. The idea is the same: subtract a baseline from the reward to reduce variance without introducing bias.
GRPO’s insight is beautifully simple. Instead of training a massive value network to estimate “how good is this partial response?”, just ask: “how good is this response compared to other responses for the same prompt?”
The algorithm:
- Sample a group: For each prompt , generate completions from the current policy:
- Score each completion: Compute reward for each (e.g., a correctness check for math, or a reward model score)
- Normalize within the group: Compute the advantage as where and are the group mean and standard deviation
- Apply PPO’s clipped objective: Use these normalized advantages with the standard clipped ratio
- Add KL penalty: Prevent divergence from the reference model
That is it. No critic. No value function training. No GAE computation. The group statistics serve as a “free” baseline.
Why this works:
- If one response in a group scores much higher than the others, it gets a large positive advantage — “do more of this”
- If a response scores below the group average, it gets a negative advantage — “do less of this”
- The normalization ensures the learning signal is well-scaled regardless of reward magnitude
GRPO objective (summary):
For a prompt , sample completions . The group-relative advantage is:
These advantages feed into PPO’s clipped objective at the token level, plus a KL penalty against the reference model:
Key observation: is the same for all tokens in response — the advantage is at the response level, not the token level. This is precisely because GRPO replaces per-token credit assignment (which requires a critic) with per-response scoring.
For the full derivation, KL estimator details, and the connection to REINFORCE, see the GRPO deep dive.
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
from typing import List, Dict, Tuple
import numpy as np
@dataclass
class GRPOConfig:
"""Configuration for GRPO training."""
group_size: int = 8 # Completions per prompt
clip_range: float = 0.2 # PPO clip parameter
beta: float = 0.04 # KL penalty coefficient
learning_rate: float = 1e-6
max_new_tokens: int = 256
temperature: float = 0.7
ppo_epochs: int = 1 # Update epochs per batch
def grpo_step(
policy: nn.Module,
ref_policy: nn.Module,
prompts: List[str],
tokenizer,
reward_fn,
config: GRPOConfig,
optimizer: torch.optim.Optimizer
) -> Dict[str, float]:
"""
One GRPO training step.
Args:
policy: Current policy model
ref_policy: Frozen reference model
prompts: Batch of prompts
tokenizer: Tokenizer
reward_fn: Function(prompt, response) -> float
config: GRPO configuration
optimizer: Optimizer for policy
Returns:
Dictionary of training metrics
"""
device = next(policy.parameters()).device
all_metrics = []
for prompt in prompts:
# Step 1: Generate G completions
prompt_ids = tokenizer(
prompt, return_tensors='pt'
).input_ids.to(device)
prompt_len = prompt_ids.shape[1]
# Repeat prompt for group generation
repeated_ids = prompt_ids.repeat(config.group_size, 1)
with torch.no_grad():
outputs = policy.generate(
repeated_ids,
max_new_tokens=config.max_new_tokens,
do_sample=True,
temperature=config.temperature,
)
response_ids = outputs[:, prompt_len:]
responses = tokenizer.batch_decode(
response_ids, skip_special_tokens=True
)
# Step 2: Score each completion
rewards = torch.tensor([
reward_fn(prompt, resp) for resp in responses
], dtype=torch.float32, device=device)
# Step 3: Compute group-relative advantages
mu = rewards.mean()
sigma = rewards.std() + 1e-8
advantages = (rewards - mu) / sigma # [G]
# Step 4: Compute old log-probs (for ratio)
with torch.no_grad():
old_logits = policy(outputs).logits
old_log_probs = F.log_softmax(old_logits, dim=-1)
old_token_log_probs = old_log_probs[:, prompt_len - 1:-1, :].gather(
dim=-1, index=response_ids.unsqueeze(-1)
).squeeze(-1) # [G, T]
# Reference log-probs for KL
ref_logits = ref_policy(outputs).logits
ref_log_probs = F.log_softmax(ref_logits, dim=-1)
ref_token_log_probs = ref_log_probs[:, prompt_len - 1:-1, :].gather(
dim=-1, index=response_ids.unsqueeze(-1)
).squeeze(-1) # [G, T]
# Step 5: PPO update with clipped objective
for _ in range(config.ppo_epochs):
new_logits = policy(outputs).logits
new_log_probs = F.log_softmax(new_logits, dim=-1)
new_token_log_probs = new_log_probs[:, prompt_len - 1:-1, :].gather(
dim=-1, index=response_ids.unsqueeze(-1)
).squeeze(-1) # [G, T]
# Per-token ratio
ratio = torch.exp(
new_token_log_probs - old_token_log_probs
) # [G, T]
# Expand advantages to token level
adv = advantages.unsqueeze(1).expand_as(ratio) # [G, T]
# Clipped surrogate loss
pg_loss1 = -adv * ratio
pg_loss2 = -adv * torch.clamp(
ratio, 1 - config.clip_range, 1 + config.clip_range
)
pg_loss = torch.max(pg_loss1, pg_loss2).mean()
# KL penalty (unbiased estimator)
log_ratio = ref_token_log_probs - new_token_log_probs
kl = (torch.exp(log_ratio) - log_ratio - 1).mean()
loss = pg_loss + config.beta * kl
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
optimizer.step()
all_metrics.append({
'loss': loss.item(),
'mean_reward': rewards.mean().item(),
'reward_std': rewards.std().item(),
'kl': kl.item(),
'mean_advantage': advantages.mean().item(),
})
# Average across prompts
return {
k: np.mean([m[k] for m in all_metrics])
for k in all_metrics[0].keys()
}When GRPO shines:
- Verifiable rewards: Math problems where you can check correctness, code tasks where you can run tests. No reward model needed — just a binary “correct or incorrect” signal.
- Reasoning tasks: The group sampling naturally explores different reasoning paths, and GRPO reinforces the ones that reach correct answers.
- Memory-constrained settings: Only 2 models in memory (policy + reference), compared to PPO’s 4.
The tradeoff: GRPO needs to generate completions per prompt, which costs inference time. For , you generate 8x more tokens per training step than PPO. But generation is typically cheaper than training a separate critic, and the algorithmic simplicity makes implementation much easier.
Modern Variants: SimPO, KTO, and ORPO
Removes the reference model entirely. Uses the average log-probability of the response as an implicit reward, eliminating the need to store and query a reference policy. Simpler than DPO with competitive performance.
Works with unpaired data: you only need “this response is good” or “this response is bad” labels — no pairwise comparisons required. Inspired by prospect theory from behavioral economics, it handles the asymmetry between gains and losses.
Combines SFT and alignment in a single training step. Adds a preference penalty directly to the language modeling loss, requiring no reference model and no separate alignment phase. The simplest pipeline of all.
The trend is clear: each generation of algorithms removes a component. PPO needs four models. DPO needs two. SimPO and ORPO need just one. The field is converging on approaches where alignment is built into the training objective itself, rather than layered on as a separate stage.
Algorithm Selection Guide
Comparison Table
| Algorithm | Models in Memory | Data Type | Online / Offline | Best For |
|---|---|---|---|---|
| PPO | 4 (policy, ref, critic, reward) | Reward model scores | Online | General alignment with fine credit assignment |
| DPO | 2 (policy, reference) | Paired preferences | Offline | Simple alignment from preference datasets |
| GRPO | 2 (policy, reference) | Any scalar reward | Online | Reasoning tasks with verifiable rewards |
| KTO | 2 (policy, reference) | Unpaired good/bad labels | Offline | When pairwise data is unavailable |
| SimPO | 1 (policy only) | Paired preferences | Offline | Maximum memory efficiency |
Unifying perspective:
All of these algorithms optimize variants of the same core objective:
They differ in:
- How quality is measured: reward model (PPO), implicit reward from preferences (DPO), group-relative rewards (GRPO), correctness checks (GRPO with verifiable rewards)
- How divergence is controlled: explicit KL penalty (PPO, GRPO), implicit through the loss (DPO), or removed entirely (SimPO)
- Whether the critic is needed: Yes (PPO), No (DPO, GRPO, KTO, SimPO)
- Whether training is online: Yes (PPO, GRPO), No (DPO, KTO, SimPO)
The evolution from PPO to GRPO is the story of discovering which components are essential and which can be replaced by simpler alternatives.
Summary
The landscape of RL algorithms for LLMs has evolved rapidly:
- PPO is the classic RLHF algorithm: powerful but memory-hungry, requiring four models and careful tuning of the value function, KL coefficient, and clipping parameters
- DPO eliminates the reward model and the RL loop entirely, turning alignment into supervised learning on preference pairs — simpler and cheaper, but limited to offline data
- GRPO eliminates the critic by using group statistics as a baseline, combining PPO’s online learning with DPO’s simplicity — the algorithm behind DeepSeek R1’s reasoning capabilities
- Modern variants (SimPO, KTO, ORPO) push further toward simplicity, removing even the reference model
The trend is unmistakable: simpler is better, as long as you can preserve the core learning signal. Each algorithm removes a component while maintaining (or improving) alignment quality.
GRPO is more than an efficiency improvement over PPO. When combined with verifiable rewards from math and code, it produces something remarkable: models that learn to reason step by step, without ever being taught chain-of-thought. The next section explores how DeepSeek R1 used GRPO to create emergent reasoning.