The Policy Gradient Theorem and REINFORCE
What You'll Learn
- State and explain the policy gradient theorem
- Understand why the log-probability trick makes gradient estimation possible
- Implement the REINFORCE algorithm from scratch
- Explain the high variance problem and why baselines help
- Train a policy gradient agent on CartPole
The Gradient Challenge
In the previous chapter, we established our goal: find parameters that maximize expected return:
We want to do gradient ascent: .
But how do we compute ? The expectation is over trajectories, and which trajectories we see depends on (through the policy). This seems to require differentiating through the environment—which we can’t do.
The policy gradient theorem solves this elegantly.
The Log-Probability Trick
Before stating the full theorem, let’s understand the key mathematical insight that makes everything work.
Here’s the core trick. We want to compute:
The problem is that what we’re averaging over (the distribution ) depends on .
The log-probability trick rewrites this as:
The gradient moved inside the expectation! Now we can estimate this with samples: collect from , and average .
Mathematical Details
Derivation of the log-probability trick:
Starting with:
We can move the gradient inside (under suitable regularity conditions):
Now the key identity:
This follows from the chain rule:
Substituting:
This is remarkable: we’ve converted the gradient of an expectation into an expectation of a gradient—something we can estimate with samples!
Why Log-Probability?
The term has a beautiful interpretation. It points in the direction that most increases the log-probability of action .
- If we update in this direction, action becomes more likely
- If we update in the opposite direction, action becomes less likely
The policy gradient multiplies this by the return :
- High return? Move toward actions that led to it
- Low return? Move away from those actions
It’s a formalization of “reinforce good behavior, discourage bad behavior.”
Think of as an “arrow” pointing toward making action more likely. The return determines whether we follow that arrow (positive return) or go the opposite way (negative return).
The Policy Gradient Theorem
Now we can state the main result.
Mathematical Details
Policy Gradient Theorem:
where is the return from time .
Equivalently, using Q-values:
This says: the gradient of expected return equals the expected “score function gradient” weighted by returns (or Q-values).
Let’s unpack what this means:
- Sample trajectories by running the policy
- For each step, compute —the direction to make that action more likely
- Weight by return —good outcomes get positive weight, bad outcomes get negative weight
- Average over samples to estimate the gradient
The policy learns to repeat high-return actions and avoid low-return actions. Actions are credited with the returns that followed them.
Sketch of the policy gradient theorem derivation
The full derivation involves expanding the trajectory probability and carefully handling the terms. Here’s a sketch:
The probability of a trajectory is:
Taking the log:
When we take , only the policy terms survive (the dynamics don’t depend on ):
Applying the log-probability trick to :
The “reward-to-go” form instead of full trajectory return comes from causality—actions can only affect future rewards, not past ones.
The REINFORCE Algorithm
REINFORCE is the simplest algorithm that implements the policy gradient theorem. It’s a Monte Carlo method: it uses complete episode returns.
The REINFORCE algorithm is surprisingly simple:
- Run an episode using current policy, collecting
- Compute returns for each time step
- Update parameters:
- Repeat
That’s it! We’re doing gradient ascent on expected return, using Monte Carlo estimates of the gradient.
Mathematical Details
REINFORCE Update:
For each episode, for each time step :
where:
- The factor accounts for discounting (sometimes omitted)
- is the learning rate
For a softmax policy with linear preferences, has a closed form. For neural network policies, we use automatic differentiation.
Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np
class PolicyNetwork(nn.Module):
"""Simple neural network policy for discrete actions."""
def __init__(self, state_dim, n_actions, hidden_dim=128):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_actions)
)
def forward(self, state):
"""Return action logits."""
return self.network(state)
def get_action(self, state):
"""Sample action from policy and return action + log probability."""
logits = self.forward(state)
dist = Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
class REINFORCE:
"""REINFORCE algorithm for policy gradient learning."""
def __init__(self, state_dim, n_actions, lr=0.001, gamma=0.99):
self.gamma = gamma
self.policy = PolicyNetwork(state_dim, n_actions)
self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
# Storage for episode data
self.log_probs = []
self.rewards = []
def select_action(self, state):
"""Select action using current policy."""
state = torch.FloatTensor(state).unsqueeze(0)
action, log_prob = self.policy.get_action(state)
self.log_probs.append(log_prob)
return action
def store_reward(self, reward):
"""Store reward for current step."""
self.rewards.append(reward)
def compute_returns(self):
"""Compute discounted returns G_t for each step."""
returns = []
G = 0
# Work backwards through rewards
for reward in reversed(self.rewards):
G = reward + self.gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
# Normalize returns (helps with training stability)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
return returns
def update(self):
"""Perform REINFORCE update after episode ends."""
returns = self.compute_returns()
# Compute policy gradient loss
# Loss = -sum(log_prob * return) (negative for gradient ascent)
policy_loss = []
for log_prob, G in zip(self.log_probs, returns):
policy_loss.append(-log_prob * G)
policy_loss = torch.stack(policy_loss).sum()
# Update policy
self.optimizer.zero_grad()
policy_loss.backward()
self.optimizer.step()
# Clear episode storage
self.log_probs = []
self.rewards = []
return policy_loss.item()Training on CartPole
Implementation
import gymnasium as gym
def train_reinforce(env_name='CartPole-v1', num_episodes=1000):
"""Train REINFORCE on CartPole."""
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n
agent = REINFORCE(state_dim, n_actions, lr=0.01, gamma=0.99)
episode_rewards = []
for episode in range(num_episodes):
state, _ = env.reset()
done = False
total_reward = 0
while not done:
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.store_reward(reward)
total_reward += reward
state = next_state
# Update policy after episode
agent.update()
episode_rewards.append(total_reward)
if episode % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode}, Avg Reward (last 100): {avg_reward:.1f}")
return agent, episode_rewards
# Train the agent
agent, rewards = train_reinforce(num_episodes=1000)You’ll notice the learning curve is noisy—episode rewards jump around even as the average improves. This is the high variance of Monte Carlo methods. Each episode gives a different return, leading to noisy gradient estimates.
The High Variance Problem
REINFORCE works, but it has a fundamental limitation: high variance.
Consider what happens in a single episode. The return might be 100. In the next episode from the same state, it might be 50 due to random outcomes. This variance in returns directly translates to variance in our gradient estimates.
High variance means:
- Learning is noisy and unstable
- We need many episodes for reliable updates
- Convergence is slow
The problem gets worse with:
- Longer episodes (more randomness compounds)
- Stochastic environments
- High-dimensional action spaces
Mathematical Details
The gradient estimate is:
The variance of this estimate depends on the variance of . Even for the same state-action pair, varies across episodes because:
- The policy is stochastic (different actions in the future)
- The environment may be stochastic (different transitions)
- Different episode lengths
This variance doesn’t decrease with more time steps in an episode—it can actually increase!
Baselines Reduce Variance
Here’s a remarkable fact: we can subtract a baseline from the returns without changing the expected gradient—but it can dramatically reduce variance.
Mathematical Details
The policy gradient with a baseline is:
Key insight: This has the same expectation as without the baseline, but potentially much lower variance.
Why doesn’t the baseline add bias? Because:
The second equality holds because .
Think of the baseline as a “reference point” for returns. Without a baseline:
- Return of 100 → strong positive update
- Return of 95 → still strong positive update
With a baseline of 97:
- Return of 100 → modest positive update (100 - 97 = 3)
- Return of 95 → modest negative update (95 - 97 = -2)
The baseline helps distinguish “good” from “average” rather than just “positive” from “negative.” This is especially important when all returns are positive (or all negative)!
REINFORCE with Baseline
Implementation
class REINFORCEWithBaseline:
"""REINFORCE with a learned value function baseline."""
def __init__(self, state_dim, n_actions, lr=0.001, gamma=0.99):
self.gamma = gamma
# Policy network (actor)
self.policy = PolicyNetwork(state_dim, n_actions)
# Value network (baseline/critic)
self.value_net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr)
self.log_probs = []
self.values = []
self.rewards = []
def select_action(self, state):
"""Select action and store log prob and value estimate."""
state_tensor = torch.FloatTensor(state).unsqueeze(0)
# Get action
action, log_prob = self.policy.get_action(state_tensor)
self.log_probs.append(log_prob)
# Get baseline value
value = self.value_net(state_tensor)
self.values.append(value)
return action
def store_reward(self, reward):
self.rewards.append(reward)
def update(self):
"""Update policy and value network."""
# Compute returns
returns = []
G = 0
for reward in reversed(self.rewards):
G = reward + self.gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
# Stack values
values = torch.cat(self.values).squeeze()
# Advantages = returns - baseline
advantages = returns - values.detach()
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# Policy loss (REINFORCE with baseline)
policy_loss = []
for log_prob, advantage in zip(self.log_probs, advantages):
policy_loss.append(-log_prob * advantage)
policy_loss = torch.stack(policy_loss).sum()
# Value loss (MSE between predicted values and returns)
value_loss = nn.functional.mse_loss(values, returns)
# Update policy
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# Update value network
self.value_optimizer.zero_grad()
value_loss.backward()
self.value_optimizer.step()
# Clear storage
self.log_probs = []
self.values = []
self.rewards = []
return policy_loss.item(), value_loss.item()The optimal baseline is close to the value function . That’s why we learn a value network as our baseline. This is our first step toward actor-critic methods—the topic of the next chapter.
Reward-to-Go: Causality Matters
There’s another variance reduction technique: using reward-to-go instead of the full episode return.
In basic REINFORCE, we weight each by the full return . But action can only affect rewards from time onward—not past rewards!
Including past rewards adds noise without adding information. Instead, we should use:
This is the return from time , not the full episode return. It’s what the action at time actually influenced.
Mathematical Details
The reward-to-go version of the policy gradient:
Compare to using full return:
Both have the same expectation (the early rewards contribute zero in expectation), but reward-to-go has lower variance.
Our implementation already uses reward-to-go—we compute for each time step, not a single return for the whole episode. This is standard practice.
REINFORCE in Action
Let’s compare vanilla REINFORCE with the baseline version:
Implementation
def compare_reinforce_variants(num_episodes=1000, num_seeds=5):
"""Compare vanilla REINFORCE vs. with baseline."""
import matplotlib.pyplot as plt
results = {'vanilla': [], 'baseline': []}
for seed in range(num_seeds):
# Set random seed
torch.manual_seed(seed)
np.random.seed(seed)
# Train vanilla REINFORCE
env = gym.make('CartPole-v1')
agent = REINFORCE(4, 2, lr=0.01, gamma=0.99)
rewards = []
for _ in range(num_episodes):
state, _ = env.reset()
done = False
ep_reward = 0
while not done:
action = agent.select_action(state)
state, reward, term, trunc, _ = env.step(action)
agent.store_reward(reward)
ep_reward += reward
done = term or trunc
agent.update()
rewards.append(ep_reward)
results['vanilla'].append(rewards)
# Train REINFORCE with baseline
env = gym.make('CartPole-v1')
agent = REINFORCEWithBaseline(4, 2, lr=0.01, gamma=0.99)
rewards = []
for _ in range(num_episodes):
state, _ = env.reset()
done = False
ep_reward = 0
while not done:
action = agent.select_action(state)
state, reward, term, trunc, _ = env.step(action)
agent.store_reward(reward)
ep_reward += reward
done = term or trunc
agent.update()
rewards.append(ep_reward)
results['baseline'].append(rewards)
# Plot results
def smooth(data, window=50):
return np.convolve(data, np.ones(window)/window, mode='valid')
plt.figure(figsize=(10, 6))
for name, runs in results.items():
mean = np.mean(runs, axis=0)
std = np.std(runs, axis=0)
smoothed = smooth(mean)
x = np.arange(len(smoothed))
plt.plot(x, smoothed, label=name)
plt.fill_between(x, smooth(mean - std), smooth(mean + std), alpha=0.2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('REINFORCE: Vanilla vs. With Baseline')
plt.legend()
plt.savefig('reinforce_comparison.png')
plt.show()
# Run comparison
compare_reinforce_variants()REINFORCE, even with a baseline, can still be unstable. The learning rate is particularly important—too high and the policy can collapse. Start with small learning rates (0.001 to 0.01) and increase carefully.
Summary
Key Takeaways
- The policy gradient theorem shows that
- The log-probability trick converts the gradient of an expectation into an expectation of a gradient
- REINFORCE is a Monte Carlo policy gradient algorithm: run episodes, compute returns, update policy
- High variance is REINFORCE’s main weakness—gradient estimates are noisy
- Baselines reduce variance without adding bias: use instead of
- Reward-to-go further reduces variance by only using future rewards
- A learned value function makes an effective baseline
REINFORCE is elegant and it works, but it’s slow and noisy. We wait for entire episodes, use Monte Carlo returns with high variance, and throw away experience after one update.
What if we could:
- Learn from single steps (like TD learning)?
- Use a learned value function to estimate returns (lower variance)?
- Update more frequently?
That’s exactly what actor-critic methods do—combining the best of policy gradients and value-based learning.
Exercises
Conceptual Questions
-
Explain the log-probability trick in your own words. Why is it necessary for policy gradients? What would go wrong without it?
-
Why does REINFORCE have high variance? Identify at least two sources of randomness that contribute to variance in gradient estimates.
-
Prove that subtracting a baseline doesn’t add bias. Specifically, show that .
-
What is causality in the context of policy gradients? Why does reward-to-go reduce variance?
-
Compare REINFORCE to Q-learning. Which is on-policy? Which requires complete episodes? Which typically has higher variance?
Coding Challenges
-
Implement REINFORCE from scratch and train on CartPole. Plot the learning curve over 1000 episodes. What’s the average episode length after training?
-
Add a simple baseline (running average of recent returns) and compare the learning curve to vanilla REINFORCE. Does it help?
-
Implement reward-to-go (if not already using it) and compare to using full episode return for each step. Measure the variance of gradient estimates.
Exploration
-
Learning rate sensitivity. Run REINFORCE with different learning rates (0.001, 0.01, 0.1). What do you observe? Why is policy gradient particularly sensitive to learning rate?
-
Entropy exploration. Add an entropy bonus to the policy loss: . How does this affect exploration and learning?