Chapter 202
📝Draft

The Policy Gradient Theorem and REINFORCE

Master the fundamental theorem that enables learning policies through gradient ascent

The Policy Gradient Theorem and REINFORCE

What You'll Learn

  • State and explain the policy gradient theorem
  • Understand why the log-probability trick makes gradient estimation possible
  • Implement the REINFORCE algorithm from scratch
  • Explain the high variance problem and why baselines help
  • Train a policy gradient agent on CartPole

The Gradient Challenge

In the previous chapter, we established our goal: find parameters θ\theta that maximize expected return:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

We want to do gradient ascent: θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta).

But how do we compute θJ(θ)\nabla_\theta J(\theta)? The expectation is over trajectories, and which trajectories we see depends on θ\theta (through the policy). This seems to require differentiating through the environment—which we can’t do.

The policy gradient theorem solves this elegantly.

The Log-Probability Trick

Before stating the full theorem, let’s understand the key mathematical insight that makes everything work.

Here’s the core trick. We want to compute:

θExpθ[f(x)]\nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)]

The problem is that what we’re averaging over (the distribution pθp_\theta) depends on θ\theta.

The log-probability trick rewrites this as:

θExpθ[f(x)]=Expθ[f(x)θlogpθ(x)]\nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \mathbb{E}_{x \sim p_\theta}[f(x) \cdot \nabla_\theta \log p_\theta(x)]

The gradient moved inside the expectation! Now we can estimate this with samples: collect x1,x2,...x_1, x_2, ... from pθp_\theta, and average f(xi)θlogpθ(xi)f(x_i) \cdot \nabla_\theta \log p_\theta(x_i).

Mathematical Details

Derivation of the log-probability trick:

Starting with: θExpθ[f(x)]=θpθ(x)f(x)dx\nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \nabla_\theta \int p_\theta(x) f(x) dx

We can move the gradient inside (under suitable regularity conditions): =θpθ(x)f(x)dx= \int \nabla_\theta p_\theta(x) \cdot f(x) dx

Now the key identity: θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta p_\theta(x) = p_\theta(x) \cdot \nabla_\theta \log p_\theta(x)

This follows from the chain rule: θlogpθ(x)=θpθ(x)pθ(x)\nabla_\theta \log p_\theta(x) = \frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}

Substituting: =pθ(x)θlogpθ(x)f(x)dx= \int p_\theta(x) \cdot \nabla_\theta \log p_\theta(x) \cdot f(x) dx

=Expθ[f(x)θlogpθ(x)]= \mathbb{E}_{x \sim p_\theta}[f(x) \cdot \nabla_\theta \log p_\theta(x)]

This is remarkable: we’ve converted the gradient of an expectation into an expectation of a gradient—something we can estimate with samples!

Why Log-Probability?

The term θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) has a beautiful interpretation. It points in the direction that most increases the log-probability of action aa.

  • If we update θ\theta in this direction, action aa becomes more likely
  • If we update in the opposite direction, action aa becomes less likely

The policy gradient multiplies this by the return R(τ)R(\tau):

  • High return? Move toward actions that led to it
  • Low return? Move away from those actions

It’s a formalization of “reinforce good behavior, discourage bad behavior.”

💡Tip

Think of θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) as an “arrow” pointing toward making action aa more likely. The return RR determines whether we follow that arrow (positive return) or go the opposite way (negative return).

The Policy Gradient Theorem

Now we can state the main result.

Mathematical Details

Policy Gradient Theorem:

θJ(θ)=Eπθ[t=0Tθlogπθ(AtSt)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t|S_t) \cdot G_t\right]

where Gt=k=tTγktRk+1G_t = \sum_{k=t}^{T} \gamma^{k-t} R_{k+1} is the return from time tt.

Equivalently, using Q-values:

θJ(θ)=Eπθ[θlogπθ(AtSt)Qπθ(St,At)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(A_t|S_t) \cdot Q^{\pi_\theta}(S_t, A_t)\right]

This says: the gradient of expected return equals the expected “score function gradient” weighted by returns (or Q-values).

Let’s unpack what this means:

  1. Sample trajectories by running the policy πθ\pi_\theta
  2. For each step, compute θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t)—the direction to make that action more likely
  3. Weight by return GtG_t—good outcomes get positive weight, bad outcomes get negative weight
  4. Average over samples to estimate the gradient

The policy learns to repeat high-return actions and avoid low-return actions. Actions are credited with the returns that followed them.

🔍Sketch of the policy gradient theorem derivation

The full derivation involves expanding the trajectory probability and carefully handling the terms. Here’s a sketch:

The probability of a trajectory τ=(s0,a0,r1,s1,a1,...)\tau = (s_0, a_0, r_1, s_1, a_1, ...) is:

P(τθ)=p(s0)t=0T1πθ(atst)p(st+1st,at)P(\tau|\theta) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) \cdot p(s_{t+1}|s_t, a_t)

Taking the log:

logP(τθ)=logp(s0)+t=0T1logπθ(atst)+t=0T1logp(st+1st,at)\log P(\tau|\theta) = \log p(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log p(s_{t+1}|s_t, a_t)

When we take θ\nabla_\theta, only the policy terms survive (the dynamics p(ss,a)p(s'|s,a) don’t depend on θ\theta):

θlogP(τθ)=t=0T1θlogπθ(atst)\nabla_\theta \log P(\tau|\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)

Applying the log-probability trick to J(θ)=Eτ[R(τ)]J(\theta) = \mathbb{E}_\tau[R(\tau)]:

θJ(θ)=Eτ[R(τ)θlogP(τθ)]=Eτ[R(τ)t=0T1θlogπθ(atst)]\nabla_\theta J(\theta) = \mathbb{E}_\tau\left[R(\tau) \cdot \nabla_\theta \log P(\tau|\theta)\right] = \mathbb{E}_\tau\left[R(\tau) \cdot \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\right]

The “reward-to-go” form GtG_t instead of full trajectory return R(τ)R(\tau) comes from causality—actions can only affect future rewards, not past ones.

The REINFORCE Algorithm

REINFORCE is the simplest algorithm that implements the policy gradient theorem. It’s a Monte Carlo method: it uses complete episode returns.

The REINFORCE algorithm is surprisingly simple:

  1. Run an episode using current policy, collecting (st,at,rt+1)(s_t, a_t, r_{t+1})
  2. Compute returns GtG_t for each time step
  3. Update parameters: θθ+αtGtθlogπθ(atst)\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi_\theta(a_t|s_t)
  4. Repeat

That’s it! We’re doing gradient ascent on expected return, using Monte Carlo estimates of the gradient.

Mathematical Details

REINFORCE Update:

For each episode, for each time step tt:

θθ+αγtGtθlogπθ(AtSt)\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \log \pi_\theta(A_t|S_t)

where:

  • Gt=Rt+1+γRt+2+γ2Rt+3+...G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...
  • The γt\gamma^t factor accounts for discounting (sometimes omitted)
  • α\alpha is the learning rate

For a softmax policy with linear preferences, θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) has a closed form. For neural network policies, we use automatic differentiation.

</>Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np

class PolicyNetwork(nn.Module):
    """Simple neural network policy for discrete actions."""

    def __init__(self, state_dim, n_actions, hidden_dim=128):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )

    def forward(self, state):
        """Return action logits."""
        return self.network(state)

    def get_action(self, state):
        """Sample action from policy and return action + log probability."""
        logits = self.forward(state)
        dist = Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob


class REINFORCE:
    """REINFORCE algorithm for policy gradient learning."""

    def __init__(self, state_dim, n_actions, lr=0.001, gamma=0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, n_actions)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)

        # Storage for episode data
        self.log_probs = []
        self.rewards = []

    def select_action(self, state):
        """Select action using current policy."""
        state = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob = self.policy.get_action(state)
        self.log_probs.append(log_prob)
        return action

    def store_reward(self, reward):
        """Store reward for current step."""
        self.rewards.append(reward)

    def compute_returns(self):
        """Compute discounted returns G_t for each step."""
        returns = []
        G = 0
        # Work backwards through rewards
        for reward in reversed(self.rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)

        # Normalize returns (helps with training stability)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        return returns

    def update(self):
        """Perform REINFORCE update after episode ends."""
        returns = self.compute_returns()

        # Compute policy gradient loss
        # Loss = -sum(log_prob * return) (negative for gradient ascent)
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)
        policy_loss = torch.stack(policy_loss).sum()

        # Update policy
        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()

        # Clear episode storage
        self.log_probs = []
        self.rewards = []

        return policy_loss.item()

Training on CartPole

</>Implementation
import gymnasium as gym

def train_reinforce(env_name='CartPole-v1', num_episodes=1000):
    """Train REINFORCE on CartPole."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    n_actions = env.action_space.n

    agent = REINFORCE(state_dim, n_actions, lr=0.01, gamma=0.99)
    episode_rewards = []

    for episode in range(num_episodes):
        state, _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.store_reward(reward)
            total_reward += reward
            state = next_state

        # Update policy after episode
        agent.update()
        episode_rewards.append(total_reward)

        if episode % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode}, Avg Reward (last 100): {avg_reward:.1f}")

    return agent, episode_rewards


# Train the agent
agent, rewards = train_reinforce(num_episodes=1000)
ℹ️Note

You’ll notice the learning curve is noisy—episode rewards jump around even as the average improves. This is the high variance of Monte Carlo methods. Each episode gives a different return, leading to noisy gradient estimates.

The High Variance Problem

REINFORCE works, but it has a fundamental limitation: high variance.

Consider what happens in a single episode. The return GtG_t might be 100. In the next episode from the same state, it might be 50 due to random outcomes. This variance in returns directly translates to variance in our gradient estimates.

High variance means:

  • Learning is noisy and unstable
  • We need many episodes for reliable updates
  • Convergence is slow

The problem gets worse with:

  • Longer episodes (more randomness compounds)
  • Stochastic environments
  • High-dimensional action spaces
Mathematical Details

The gradient estimate is:

g^=tGtθlogπθ(atst)\hat{g} = \sum_t G_t \nabla_\theta \log \pi_\theta(a_t|s_t)

The variance of this estimate depends on the variance of GtG_t. Even for the same state-action pair, GtG_t varies across episodes because:

  1. The policy is stochastic (different actions in the future)
  2. The environment may be stochastic (different transitions)
  3. Different episode lengths

This variance doesn’t decrease with more time steps in an episode—it can actually increase!

Baselines Reduce Variance

Here’s a remarkable fact: we can subtract a baseline from the returns without changing the expected gradient—but it can dramatically reduce variance.

Mathematical Details

The policy gradient with a baseline b(s)b(s) is:

θJ(θ)=Eπθ[θlogπθ(as)(Gtb(s))]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (G_t - b(s))\right]

Key insight: This has the same expectation as without the baseline, but potentially much lower variance.

Why doesn’t the baseline add bias? Because:

Eaπθ[θlogπθ(as)b(s)]=b(s)Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = b(s) \cdot \mathbb{E}_{a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s)\right] = 0

The second equality holds because aθπθ(as)=θaπθ(as)=θ1=0\sum_a \nabla_\theta \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0.

Think of the baseline as a “reference point” for returns. Without a baseline:

  • Return of 100 → strong positive update
  • Return of 95 → still strong positive update

With a baseline of 97:

  • Return of 100 → modest positive update (100 - 97 = 3)
  • Return of 95 → modest negative update (95 - 97 = -2)

The baseline helps distinguish “good” from “average” rather than just “positive” from “negative.” This is especially important when all returns are positive (or all negative)!

REINFORCE with Baseline

</>Implementation
class REINFORCEWithBaseline:
    """REINFORCE with a learned value function baseline."""

    def __init__(self, state_dim, n_actions, lr=0.001, gamma=0.99):
        self.gamma = gamma

        # Policy network (actor)
        self.policy = PolicyNetwork(state_dim, n_actions)

        # Value network (baseline/critic)
        self.value_net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr)

        self.log_probs = []
        self.values = []
        self.rewards = []

    def select_action(self, state):
        """Select action and store log prob and value estimate."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)

        # Get action
        action, log_prob = self.policy.get_action(state_tensor)
        self.log_probs.append(log_prob)

        # Get baseline value
        value = self.value_net(state_tensor)
        self.values.append(value)

        return action

    def store_reward(self, reward):
        self.rewards.append(reward)

    def update(self):
        """Update policy and value network."""
        # Compute returns
        returns = []
        G = 0
        for reward in reversed(self.rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)

        # Stack values
        values = torch.cat(self.values).squeeze()

        # Advantages = returns - baseline
        advantages = returns - values.detach()
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # Policy loss (REINFORCE with baseline)
        policy_loss = []
        for log_prob, advantage in zip(self.log_probs, advantages):
            policy_loss.append(-log_prob * advantage)
        policy_loss = torch.stack(policy_loss).sum()

        # Value loss (MSE between predicted values and returns)
        value_loss = nn.functional.mse_loss(values, returns)

        # Update policy
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        # Update value network
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()

        # Clear storage
        self.log_probs = []
        self.values = []
        self.rewards = []

        return policy_loss.item(), value_loss.item()
💡Tip

The optimal baseline is close to the value function V(s)V(s). That’s why we learn a value network as our baseline. This is our first step toward actor-critic methods—the topic of the next chapter.

Reward-to-Go: Causality Matters

There’s another variance reduction technique: using reward-to-go instead of the full episode return.

In basic REINFORCE, we weight each θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t) by the full return G0G_0. But action ata_t can only affect rewards from time tt onward—not past rewards!

Including past rewards adds noise without adding information. Instead, we should use:

Gt=Rt+1+γRt+2+γ2Rt+3+...G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...

This is the return from time tt, not the full episode return. It’s what the action at time tt actually influenced.

Mathematical Details

The reward-to-go version of the policy gradient:

θJ(θ)=Eπθ[t=0Tθlogπθ(AtSt)(k=tTγktRk+1)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t|S_t) \cdot \left(\sum_{k=t}^{T} \gamma^{k-t} R_{k+1}\right)\right]

Compare to using full return:

θJ(θ)=Eπθ[t=0Tθlogπθ(AtSt)(k=0TγkRk+1)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t|S_t) \cdot \left(\sum_{k=0}^{T} \gamma^{k} R_{k+1}\right)\right]

Both have the same expectation (the early rewards contribute zero in expectation), but reward-to-go has lower variance.

ℹ️Note

Our implementation already uses reward-to-go—we compute GtG_t for each time step, not a single return for the whole episode. This is standard practice.

REINFORCE in Action

Let’s compare vanilla REINFORCE with the baseline version:

</>Implementation
def compare_reinforce_variants(num_episodes=1000, num_seeds=5):
    """Compare vanilla REINFORCE vs. with baseline."""
    import matplotlib.pyplot as plt

    results = {'vanilla': [], 'baseline': []}

    for seed in range(num_seeds):
        # Set random seed
        torch.manual_seed(seed)
        np.random.seed(seed)

        # Train vanilla REINFORCE
        env = gym.make('CartPole-v1')
        agent = REINFORCE(4, 2, lr=0.01, gamma=0.99)
        rewards = []
        for _ in range(num_episodes):
            state, _ = env.reset()
            done = False
            ep_reward = 0
            while not done:
                action = agent.select_action(state)
                state, reward, term, trunc, _ = env.step(action)
                agent.store_reward(reward)
                ep_reward += reward
                done = term or trunc
            agent.update()
            rewards.append(ep_reward)
        results['vanilla'].append(rewards)

        # Train REINFORCE with baseline
        env = gym.make('CartPole-v1')
        agent = REINFORCEWithBaseline(4, 2, lr=0.01, gamma=0.99)
        rewards = []
        for _ in range(num_episodes):
            state, _ = env.reset()
            done = False
            ep_reward = 0
            while not done:
                action = agent.select_action(state)
                state, reward, term, trunc, _ = env.step(action)
                agent.store_reward(reward)
                ep_reward += reward
                done = term or trunc
            agent.update()
            rewards.append(ep_reward)
        results['baseline'].append(rewards)

    # Plot results
    def smooth(data, window=50):
        return np.convolve(data, np.ones(window)/window, mode='valid')

    plt.figure(figsize=(10, 6))
    for name, runs in results.items():
        mean = np.mean(runs, axis=0)
        std = np.std(runs, axis=0)
        smoothed = smooth(mean)
        x = np.arange(len(smoothed))
        plt.plot(x, smoothed, label=name)
        plt.fill_between(x, smooth(mean - std), smooth(mean + std), alpha=0.2)

    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('REINFORCE: Vanilla vs. With Baseline')
    plt.legend()
    plt.savefig('reinforce_comparison.png')
    plt.show()


# Run comparison
compare_reinforce_variants()

Summary

Key Takeaways

  • The policy gradient theorem shows that θJ(θ)=E[θlogπθ(as)Gt]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]
  • The log-probability trick converts the gradient of an expectation into an expectation of a gradient
  • REINFORCE is a Monte Carlo policy gradient algorithm: run episodes, compute returns, update policy
  • High variance is REINFORCE’s main weakness—gradient estimates are noisy
  • Baselines reduce variance without adding bias: use Gtb(s)G_t - b(s) instead of GtG_t
  • Reward-to-go further reduces variance by only using future rewards
  • A learned value function makes an effective baseline

REINFORCE is elegant and it works, but it’s slow and noisy. We wait for entire episodes, use Monte Carlo returns with high variance, and throw away experience after one update.

What if we could:

  • Learn from single steps (like TD learning)?
  • Use a learned value function to estimate returns (lower variance)?
  • Update more frequently?

That’s exactly what actor-critic methods do—combining the best of policy gradients and value-based learning.

Next ChapterActor-Critic Methods

Exercises

Conceptual Questions

  1. Explain the log-probability trick in your own words. Why is it necessary for policy gradients? What would go wrong without it?

  2. Why does REINFORCE have high variance? Identify at least two sources of randomness that contribute to variance in gradient estimates.

  3. Prove that subtracting a baseline doesn’t add bias. Specifically, show that Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0.

  4. What is causality in the context of policy gradients? Why does reward-to-go reduce variance?

  5. Compare REINFORCE to Q-learning. Which is on-policy? Which requires complete episodes? Which typically has higher variance?

Coding Challenges

  1. Implement REINFORCE from scratch and train on CartPole. Plot the learning curve over 1000 episodes. What’s the average episode length after training?

  2. Add a simple baseline (running average of recent returns) and compare the learning curve to vanilla REINFORCE. Does it help?

  3. Implement reward-to-go (if not already using it) and compare to using full episode return for each step. Measure the variance of gradient estimates.

Exploration

  1. Learning rate sensitivity. Run REINFORCE with different learning rates (0.001, 0.01, 0.1). What do you observe? Why is policy gradient particularly sensitive to learning rate?

  2. Entropy exploration. Add an entropy bonus to the policy loss: L=tGtlogπθ(atst)βH(πθ)L = -\sum_t G_t \log \pi_\theta(a_t|s_t) - \beta H(\pi_\theta). How does this affect exploration and learning?