Building a Reasoning Model

The best way to understand an algorithm is to build it. Andrej Karpathy demonstrated this in 2016 when he distilled policy gradients into 130 lines of numpy that learned to play Pong. No frameworks, no abstractions — just the raw algorithm, laid bare.

We are going to do the same thing for GRPO. By the end of this subsection, you will have walked through a real GRPO implementation line by line, understood every simplification it makes, and written your own version. The goal is not production code — it is comprehension.

ℹ️Prerequisites

This subsection is code-heavy. It assumes you are comfortable with the REINFORCE algorithm (especially baselines) and have read the RL Algorithms for LLMs section covering GRPO’s theory. We will be connecting those ideas to real code.

Karpathy’s nanochat: GRPO in ~8K Lines

nanochat is Karpathy’s end-to-end implementation of a ChatGPT-style training pipeline. It covers everything from data preparation through SFT to RL. The RL stage lives in scripts/chat_rl.py, and it implements a stripped-down version of GRPO that removes nearly every piece of complexity.

Why does this matter? Because the simplified version still works. GSM8K accuracy jumps from around 60% to 75% after RL training. That tells us something profound: the core mechanism of GRPO — sampling a group, comparing rewards, reinforcing the winners — is the part that matters. Everything else is engineering polish.

What nanochat Strips Away

Kept (The Core)

Group sampling: generate $G$ completions per prompt

Binary rewards: correct answer = 1, wrong = 0

Advantage = reward - group mean

Token-level policy gradient weighted by advantage

Stripped (The Complexity)

No reference model or KL penalty

No PPO ratios or clipping

No sigma normalization (divides by mean only)

No multiple update epochs per batch

Each simplification has a justification:

No KL penalty: The model is trained for a small number of steps, so it does not drift far from the SFT checkpoint. Short training is the implicit regularizer.
No PPO clipping: Without importance ratios, there is nothing to clip. The update is purely on-policy — each batch of completions is generated by the current model and used exactly once.
No sigma normalization: Dividing by standard deviation helps stabilize training when reward magnitudes vary. With binary rewards (0 or 1), the variance is naturally bounded, so this is less critical.
No multiple epochs: On-policy means each sample is used once and discarded. This avoids staleness but is less sample-efficient.

The result is essentially REINFORCE with a group mean baseline. If you understood REINFORCE with baselines, you already understand the core of this algorithm.

The Algorithm, Step by Step

</>Implementation

Here is the conceptual algorithm from nanochat’s scripts/chat_rl.py, annotated for clarity. Attribution: github.com/karpathy/nanochat.

# Karpathy's simplified GRPO (conceptual, from nanochat)
# Attribution: github.com/karpathy/nanochat

for batch in training_data:
    prompts = sample_gsm8k_problems(batch_size)

    # Step 1: Generate G completions per prompt
    completions = []
    for prompt in prompts:
        for _ in range(G):
            completion = model.generate(prompt, max_tokens=512)
            completions.append(completion)

    # Step 2: Compute rewards (binary: correct answer or not)
    rewards = []
    for prompt, completion in zip(prompts_repeated, completions):
        answer = extract_answer(completion)  # Parse "#### 42" format
        correct_answer = get_ground_truth(prompt)
        reward = 1.0 if answer == correct_answer else 0.0
        rewards.append(reward)

    # Step 3: Compute advantages (per-prompt group normalization)
    advantages = []
    for i in range(0, len(rewards), G):
        group = rewards[i:i + G]
        mean_reward = sum(group) / len(group)
        # Simple: advantage = reward - mean (no division by sigma)
        for r in group:
            advantages.append(r - mean_reward)

    # Step 4: Policy gradient update
    # Weight the log-probability of each token by its completion's advantage
    loss = 0
    for completion, advantage in zip(completions, advantages):
        log_probs = model.log_prob(completion)
        # Token-level weighting
        loss -= (log_probs * advantage).sum()

    loss.backward()
    optimizer.step()

∑Mathematical Details

Let us connect this to the math. For a prompt $x$ with $G$ completions $\{y_i\}_{i=1}^G$ , each scored with binary reward $r_i$ , nanochat computes:

$\hat{A}_i = r_i - \frac{1}{G}\sum_{j=1}^G r_j$

The gradient update is:

$\nabla_\theta L = -\sum_{i=1}^G \hat{A}_i \sum_{t=1}^{T_i} \nabla_\theta \log \pi_\theta(y_{i,t} \mid x, y_{i, < t})$

Compare this to the REINFORCE with baseline gradient from Chapter 2020:

$\nabla_\theta J = \mathbb{E}\left[(R - b) \nabla_\theta \log \pi_\theta(\tau)\right]$

They are the same equation. The “group mean” is the baseline $b$ . The sum over tokens is the expansion of $\log \pi_\theta(\tau)$ for an autoregressive policy. Nanochat’s “simplified GRPO” is REINFORCE with a group mean baseline, applied token by token.

💡Why binary rewards work so well

With binary rewards (0 or 1) and a group size of $G = 8$ , suppose 3 out of 8 completions are correct. The advantages become: correct completions get $1 - 3/8 = +0.625$ , wrong completions get $0 - 3/8 = -0.375$ . The model is pushed to generate more responses like the correct ones and fewer like the wrong ones. No reward model needed — just an answer checker.

Expected Results

nanochat GRPO results on GSM8K

Before RL

~60%

After RL

~75%

A 15-point accuracy improvement from a simplified algorithm on grade school math problems.

Our Simplified GRPO Implementation

Now let us build our own version. We will keep closer to the full GRPO specification than nanochat does, including the reference model, KL penalty, and PPO clipping, but the code will remain compact and readable. Think of this as the “textbook” version: faithful to the paper, annotated for learning.

The implementation connects three algorithms you have already studied:

REINFORCE provides the core gradient: weight log-probabilities by advantages
Baselines reduce variance: the group mean replaces the learned value function
PPO clipping prevents catastrophic updates: bound the probability ratio

</>Implementation

For the full tested implementation, see code/rlbook/agents/grpo.py.

import torch
import torch.nn as nn
import torch.nn.functional as F


class SimpleGRPO:
    """Simplified GRPO trainer for educational purposes.

    Implements Group Relative Policy Optimization:
    1. Sample G completions per prompt
    2. Score each with a reward function
    3. Normalize rewards within the group (advantages)
    4. Update policy with clipped objective + KL penalty

    This is REINFORCE + group baselines + PPO clipping.
    """

    def __init__(self, model, ref_model, lr=1e-5, group_size=8,
                 clip_eps=0.2, kl_coef=0.01):
        self.model = model
        self.ref_model = ref_model  # Frozen copy for KL computation
        self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
        self.group_size = group_size
        self.clip_eps = clip_eps
        self.kl_coef = kl_coef

    def compute_advantages(self, rewards):
        """Group-relative advantage normalization.

        Args:
            rewards: Tensor of shape [batch_size, group_size]

        Returns:
            Normalized advantages [batch_size, group_size]
        """
        mean = rewards.mean(dim=1, keepdim=True)
        std = rewards.std(dim=1, keepdim=True)
        return (rewards - mean) / (std + 1e-8)

    def train_step(self, prompts, reward_fn):
        """One GRPO training step.

        Args:
            prompts: List of prompt strings
            reward_fn: Callable(prompt, completion) -> float

        Returns:
            Dictionary of training metrics
        """
        all_completions = []
        all_rewards = []

        # --- Phase 1: Sample G completions per prompt ---
        for prompt in prompts:
            completions = [
                self.model.generate(prompt)
                for _ in range(self.group_size)
            ]
            rewards = torch.tensor([
                reward_fn(prompt, c) for c in completions
            ])
            all_completions.extend(completions)
            all_rewards.append(rewards)

        rewards = torch.stack(all_rewards)       # [B, G]
        advantages = self.compute_advantages(rewards)  # [B, G]

        # --- Phase 2: Policy gradient with clipping ---
        total_loss = 0
        for i, prompt in enumerate(prompts):
            for j in range(self.group_size):
                completion = all_completions[i * self.group_size + j]
                adv = advantages[i, j]

                # Current policy log-probabilities
                log_prob = self.model.log_prob(prompt, completion)

                # Reference policy log-probabilities (no gradient)
                with torch.no_grad():
                    old_log_prob = self.ref_model.log_prob(
                        prompt, completion
                    )

                # Importance sampling ratio
                ratio = torch.exp(log_prob - old_log_prob)

                # Clipped surrogate objective (from PPO)
                unclipped = ratio * adv
                clipped = torch.clamp(
                    ratio,
                    1 - self.clip_eps,
                    1 + self.clip_eps
                ) * adv
                policy_loss = -torch.min(unclipped, clipped)

                # KL divergence penalty (simplified; see grpo-and-reasoning
                # for DeepSeek's unbiased estimator: e^r - r - 1)
                kl = (log_prob - old_log_prob).mean()

                total_loss += policy_loss + self.kl_coef * kl

        # --- Phase 3: Update ---
        self.optimizer.zero_grad()
        total_loss.backward()
        self.optimizer.step()

        return {
            'loss': total_loss.item(),
            'mean_reward': rewards.mean().item(),
            'mean_advantage': advantages.mean().item(),
        }

∑Mathematical Details

Our implementation computes the full GRPO objective. For each prompt $x$ with group $\{y_i\}_{i=1}^G$ :

Step 1 — Group-relative advantages:

$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G + \epsilon}$

where $\mu_G$ and $\sigma_G$ are the group mean and standard deviation. Unlike nanochat, we include the sigma normalization for stability.

Step 2 — Clipped surrogate loss (per token):

$L_i = -\min\left(\rho_i \hat{A}_i, \;\text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \hat{A}_i\right)$

where $\rho_i = \frac{\pi_\theta(y_i \mid x)}{\pi_{\theta_{\text{old}}}(y_i \mid x)}$ is the importance ratio against the previous policy iteration. In our simplified implementation, we use the reference model as the denominator since we perform a single update per batch (making $\pi_{\theta_{\text{old}}} \approx \pi_{\text{ref}}$ early in training).

Step 3 — KL regularization:

$L_{\text{total}} = \frac{1}{G}\sum_{i=1}^G L_i + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

The KL term prevents the policy from drifting too far from the reference, guarding against reward hacking.

⚠️Educational code vs. production code

This implementation processes completions sequentially and does not batch GPU operations. A production GRPO implementation (like the one in nanochat or TRL) batches generation and log-probability computation across the group for efficiency. The algorithm is the same; the engineering differs by orders of magnitude in throughput.

What Each Piece Contributes

To appreciate why the full GRPO has these components, consider what breaks when you remove them:

Group baseline (the core)

Without it, you have vanilla REINFORCE with enormous variance. The group mean baseline is what makes GRPO work at all. This is the one component you cannot remove.

Sigma normalization

Dividing by standard deviation scales gradients consistently. Without it, prompts where all completions are correct (or all wrong) produce near-zero advantages anyway, but prompts with mixed results can produce outsized updates. Nanochat skips this and compensates with a small learning rate.

PPO clipping

Clipping the probability ratio prevents any single update from changing the policy too drastically. Without it, a completion with a very high advantage can cause a huge policy shift. The learning rate partially mitigates this, but clipping provides a hard safety net.

KL penalty

The KL term prevents the model from drifting into regions where it generates high-reward but degenerate text (reward hacking). For short training runs like nanochat, the drift is minimal and the penalty is unnecessary. For longer runs, it is essential.

ℹ️Connecting it all together

You already know all the ingredients. GRPO is just REINFORCE (the core gradient, Chapter 2020) + group baselines (variance reduction without a critic, Chapter 2020) + PPO clipping (stable updates, Chapter 2040). This is the culmination of the policy gradient journey: the same ideas that trained an agent to play Pong now train language models to reason through math problems.

Exercises

Exercise 1: Group Size

Modify the group size from

G = 4

G = 16

and observe the effect on training stability. Smaller groups have higher variance in advantage estimates. Larger groups are more stable but cost more inference. What is the sweet spot for binary rewards?

Exercise 2: Remove the KL Penalty

Set kl_coef=0 and train for an extended run. Monitor the model’s output diversity. Does it collapse to producing the same response pattern repeatedly? How many steps before degradation becomes visible?

Exercise 3: Partial Credit Rewards

Replace the binary reward with partial credit: give 0.5 for correct intermediate steps even if the final answer is wrong. Does this speed up learning? Does it introduce reward hacking (correct steps but wrong answers)?

Exercise 4: REINFORCE Comparison

Remove PPO clipping by setting clip_eps=1000 (effectively infinite). This reduces GRPO to REINFORCE with a group baseline. Compare training curves. How much does clipping help for stability?

💡Running the exercises

For hands-on experimentation, clone nanochat and modify scripts/chat_rl.py directly. The codebase is designed to be readable and hackable. For a lighter-weight version, use the SimpleGRPO class above with any small language model and a synthetic reward function (e.g., reward longer responses, or reward responses containing a target keyword).

Summary

This subsection covered the practical side of GRPO:

Karpathy’s nanochat demonstrates that a stripped-down GRPO — essentially REINFORCE with a group mean baseline — is enough to improve math reasoning by 15 percentage points on GSM8K
The simplifications (no KL, no clipping, no sigma normalization) work because short training limits drift and binary rewards limit variance
Our implementation adds back the full GRPO components (sigma normalization, PPO clipping, KL penalty) for a more complete picture
Every component has a purpose: the group baseline is essential, everything else trades simplicity for robustness
You already knew all the pieces: REINFORCE gave you the gradient, baselines gave you variance reduction, PPO gave you clipping — GRPO is the combination applied to LLMs