Building a Reasoning Model
The best way to understand an algorithm is to build it. Andrej Karpathy demonstrated this in 2016 when he distilled policy gradients into 130 lines of numpy that learned to play Pong. No frameworks, no abstractions — just the raw algorithm, laid bare.
We are going to do the same thing for GRPO. By the end of this subsection, you will have walked through a real GRPO implementation line by line, understood every simplification it makes, and written your own version. The goal is not production code — it is comprehension.
This subsection is code-heavy. It assumes you are comfortable with the REINFORCE algorithm (especially baselines) and have read the RL Algorithms for LLMs section covering GRPO’s theory. We will be connecting those ideas to real code.
Karpathy’s nanochat: GRPO in ~8K Lines
nanochat is Karpathy’s end-to-end implementation of a ChatGPT-style training pipeline. It covers everything from data preparation through SFT to RL. The RL stage lives in scripts/chat_rl.py, and it implements a stripped-down version of GRPO that removes nearly every piece of complexity.
Why does this matter? Because the simplified version still works. GSM8K accuracy jumps from around 60% to 75% after RL training. That tells us something profound: the core mechanism of GRPO — sampling a group, comparing rewards, reinforcing the winners — is the part that matters. Everything else is engineering polish.
What nanochat Strips Away
Group sampling: generate completions per prompt
Binary rewards: correct answer = 1, wrong = 0
Advantage = reward - group mean
Token-level policy gradient weighted by advantage
No reference model or KL penalty
No PPO ratios or clipping
No sigma normalization (divides by mean only)
No multiple update epochs per batch
Each simplification has a justification:
- No KL penalty: The model is trained for a small number of steps, so it does not drift far from the SFT checkpoint. Short training is the implicit regularizer.
- No PPO clipping: Without importance ratios, there is nothing to clip. The update is purely on-policy — each batch of completions is generated by the current model and used exactly once.
- No sigma normalization: Dividing by standard deviation helps stabilize training when reward magnitudes vary. With binary rewards (0 or 1), the variance is naturally bounded, so this is less critical.
- No multiple epochs: On-policy means each sample is used once and discarded. This avoids staleness but is less sample-efficient.
The result is essentially REINFORCE with a group mean baseline. If you understood REINFORCE with baselines, you already understand the core of this algorithm.
The Algorithm, Step by Step
Here is the conceptual algorithm from nanochat’s scripts/chat_rl.py, annotated for clarity. Attribution: github.com/karpathy/nanochat.
# Karpathy's simplified GRPO (conceptual, from nanochat)
# Attribution: github.com/karpathy/nanochat
for batch in training_data:
prompts = sample_gsm8k_problems(batch_size)
# Step 1: Generate G completions per prompt
completions = []
for prompt in prompts:
for _ in range(G):
completion = model.generate(prompt, max_tokens=512)
completions.append(completion)
# Step 2: Compute rewards (binary: correct answer or not)
rewards = []
for prompt, completion in zip(prompts_repeated, completions):
answer = extract_answer(completion) # Parse "#### 42" format
correct_answer = get_ground_truth(prompt)
reward = 1.0 if answer == correct_answer else 0.0
rewards.append(reward)
# Step 3: Compute advantages (per-prompt group normalization)
advantages = []
for i in range(0, len(rewards), G):
group = rewards[i:i + G]
mean_reward = sum(group) / len(group)
# Simple: advantage = reward - mean (no division by sigma)
for r in group:
advantages.append(r - mean_reward)
# Step 4: Policy gradient update
# Weight the log-probability of each token by its completion's advantage
loss = 0
for completion, advantage in zip(completions, advantages):
log_probs = model.log_prob(completion)
# Token-level weighting
loss -= (log_probs * advantage).sum()
loss.backward()
optimizer.step()Let us connect this to the math. For a prompt with completions , each scored with binary reward , nanochat computes:
The gradient update is:
Compare this to the REINFORCE with baseline gradient from Chapter 2020:
They are the same equation. The “group mean” is the baseline . The sum over tokens is the expansion of for an autoregressive policy. Nanochat’s “simplified GRPO” is REINFORCE with a group mean baseline, applied token by token.
With binary rewards (0 or 1) and a group size of , suppose 3 out of 8 completions are correct. The advantages become: correct completions get , wrong completions get . The model is pushed to generate more responses like the correct ones and fewer like the wrong ones. No reward model needed — just an answer checker.
Expected Results
Our Simplified GRPO Implementation
Now let us build our own version. We will keep closer to the full GRPO specification than nanochat does, including the reference model, KL penalty, and PPO clipping, but the code will remain compact and readable. Think of this as the “textbook” version: faithful to the paper, annotated for learning.
The implementation connects three algorithms you have already studied:
- REINFORCE provides the core gradient: weight log-probabilities by advantages
- Baselines reduce variance: the group mean replaces the learned value function
- PPO clipping prevents catastrophic updates: bound the probability ratio
For the full tested implementation, see code/rlbook/agents/grpo.py.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleGRPO:
"""Simplified GRPO trainer for educational purposes.
Implements Group Relative Policy Optimization:
1. Sample G completions per prompt
2. Score each with a reward function
3. Normalize rewards within the group (advantages)
4. Update policy with clipped objective + KL penalty
This is REINFORCE + group baselines + PPO clipping.
"""
def __init__(self, model, ref_model, lr=1e-5, group_size=8,
clip_eps=0.2, kl_coef=0.01):
self.model = model
self.ref_model = ref_model # Frozen copy for KL computation
self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
self.group_size = group_size
self.clip_eps = clip_eps
self.kl_coef = kl_coef
def compute_advantages(self, rewards):
"""Group-relative advantage normalization.
Args:
rewards: Tensor of shape [batch_size, group_size]
Returns:
Normalized advantages [batch_size, group_size]
"""
mean = rewards.mean(dim=1, keepdim=True)
std = rewards.std(dim=1, keepdim=True)
return (rewards - mean) / (std + 1e-8)
def train_step(self, prompts, reward_fn):
"""One GRPO training step.
Args:
prompts: List of prompt strings
reward_fn: Callable(prompt, completion) -> float
Returns:
Dictionary of training metrics
"""
all_completions = []
all_rewards = []
# --- Phase 1: Sample G completions per prompt ---
for prompt in prompts:
completions = [
self.model.generate(prompt)
for _ in range(self.group_size)
]
rewards = torch.tensor([
reward_fn(prompt, c) for c in completions
])
all_completions.extend(completions)
all_rewards.append(rewards)
rewards = torch.stack(all_rewards) # [B, G]
advantages = self.compute_advantages(rewards) # [B, G]
# --- Phase 2: Policy gradient with clipping ---
total_loss = 0
for i, prompt in enumerate(prompts):
for j in range(self.group_size):
completion = all_completions[i * self.group_size + j]
adv = advantages[i, j]
# Current policy log-probabilities
log_prob = self.model.log_prob(prompt, completion)
# Reference policy log-probabilities (no gradient)
with torch.no_grad():
old_log_prob = self.ref_model.log_prob(
prompt, completion
)
# Importance sampling ratio
ratio = torch.exp(log_prob - old_log_prob)
# Clipped surrogate objective (from PPO)
unclipped = ratio * adv
clipped = torch.clamp(
ratio,
1 - self.clip_eps,
1 + self.clip_eps
) * adv
policy_loss = -torch.min(unclipped, clipped)
# KL divergence penalty (simplified; see grpo-and-reasoning
# for DeepSeek's unbiased estimator: e^r - r - 1)
kl = (log_prob - old_log_prob).mean()
total_loss += policy_loss + self.kl_coef * kl
# --- Phase 3: Update ---
self.optimizer.zero_grad()
total_loss.backward()
self.optimizer.step()
return {
'loss': total_loss.item(),
'mean_reward': rewards.mean().item(),
'mean_advantage': advantages.mean().item(),
}Our implementation computes the full GRPO objective. For each prompt with group :
Step 1 — Group-relative advantages:
where and are the group mean and standard deviation. Unlike nanochat, we include the sigma normalization for stability.
Step 2 — Clipped surrogate loss (per token):
where is the importance ratio against the previous policy iteration. In our simplified implementation, we use the reference model as the denominator since we perform a single update per batch (making early in training).
Step 3 — KL regularization:
The KL term prevents the policy from drifting too far from the reference, guarding against reward hacking.
This implementation processes completions sequentially and does not batch GPU operations. A production GRPO implementation (like the one in nanochat or TRL) batches generation and log-probability computation across the group for efficiency. The algorithm is the same; the engineering differs by orders of magnitude in throughput.
What Each Piece Contributes
To appreciate why the full GRPO has these components, consider what breaks when you remove them:
You already know all the ingredients. GRPO is just REINFORCE (the core gradient, Chapter 2020) + group baselines (variance reduction without a critic, Chapter 2020) + PPO clipping (stable updates, Chapter 2040). This is the culmination of the policy gradient journey: the same ideas that trained an agent to play Pong now train language models to reason through math problems.
Exercises
kl_coef=0 and train for an extended run. Monitor the model’s output diversity. Does it collapse to producing the same response pattern repeatedly? How many steps before degradation becomes visible?clip_eps=1000 (effectively infinite). This reduces GRPO to REINFORCE with a group baseline. Compare training curves. How much does clipping help for stability?For hands-on experimentation, clone nanochat and modify scripts/chat_rl.py directly. The codebase is designed to be readable and hackable. For a lighter-weight version, use the SimpleGRPO class above with any small language model and a synthetic reward function (e.g., reward longer responses, or reward responses containing a target keyword).
Summary
This subsection covered the practical side of GRPO:
- Karpathy’s nanochat demonstrates that a stripped-down GRPO — essentially REINFORCE with a group mean baseline — is enough to improve math reasoning by 15 percentage points on GSM8K
- The simplifications (no KL, no clipping, no sigma normalization) work because short training limits drift and binary rewards limit variance
- Our implementation adds back the full GRPO components (sigma normalization, PPO clipping, KL penalty) for a more complete picture
- Every component has a purpose: the group baseline is essential, everything else trades simplicity for robustness
- You already knew all the pieces: REINFORCE gave you the gradient, baselines gave you variance reduction, PPO gave you clipping — GRPO is the combination applied to LLMs