The Variance Problem
REINFORCE is elegant and simple, but it suffers from a critical weakness: high variance. The gradient estimates can vary wildly from episode to episode, making learning slow and unstable. Understanding why this happens is key to appreciating the improvements that come next.
What Is the Problem?
Imagine you’re learning to play basketball. You take 10 shots and make 6 of them. Was that a good shooting session?
It depends. If you’re usually a 70% shooter, making 60% is below average - maybe your form was off. If you’re usually a 30% shooter, 60% is excellent - something went right.
The problem is: any single session is a noisy estimate of your true ability. You might have been lucky or unlucky.
REINFORCE has the same issue. Each episode gives you one sample of “how good these actions were.” But that sample is contaminated by:
- Random environment outcomes
- Random action selections (stochastic policy)
- Lucky or unlucky reward sequences
One episode is not enough to know if an action is truly good or just got lucky.
Sources of Variance
1. Stochastic Environment
Even if you take the same actions, the environment might give different outcomes.
Consider a robot learning to walk. The same leg movement might:
- Work perfectly on a stable surface
- Cause a stumble on an uneven surface
- Lead to a fall if there’s an unexpected bump
The return varies due to environment randomness, not action quality.
2. Stochastic Policy
A stochastic policy samples different actions in the same state. This is good for exploration, but adds variance.
In state , your policy might:
- Sample action A (60% probability) and get reward 10
- Sample action B (40% probability) and get reward 5
Over many episodes, the average evens out. But each individual episode has different actions, leading to different returns.
3. Long Episodes and Credit Assignment
In a long episode, which action caused the final outcome?
Imagine a 100-step episode where you win at the end. REINFORCE credits ALL 100 actions equally with the positive return. But maybe:
- Only the last 10 actions mattered
- Early actions were irrelevant
- Some middle actions were actually mistakes that you recovered from
The return contains all future rewards, making it hard to isolate which action deserves credit.
The return from timestep includes rewards from all future steps:
This accumulates randomness from:
- Stochastic rewards at each step
- Stochastic transitions affecting future states
- Stochastic future action selections
The variance compounds over the episode length.
4. Multiplying by Returns
The gradient estimate is:
The variance of this product depends on the variance of . If returns range from 0 to 1000, the gradient estimates will also vary by orders of magnitude.
High-magnitude returns create high-variance gradients, even if they’re equally “good” relative to the task.
Visualizing the Problem
Consider two training runs of REINFORCE on the same task:
Run 1 (lucky episodes):
- Episode 1: Return = 150 (got lucky early)
- Episode 2: Return = 180 (good actions + luck)
- Episode 3: Return = 90 (unlucky despite good actions)
Run 2 (unlucky episodes):
- Episode 1: Return = 50 (unlucky start)
- Episode 2: Return = 70 (mediocre luck)
- Episode 3: Return = 200 (finally got lucky)
Both runs might have similar underlying policy quality, but the gradient updates are completely different. Run 1 strongly reinforces early actions; Run 2 strongly reinforces late actions.
import numpy as np
import matplotlib.pyplot as plt
def simulate_gradient_variance(n_episodes=100, n_trials=5):
"""
Simulate how gradient estimates vary across episodes.
This is a simplified demonstration - we show that
different episodes give very different gradient signals.
"""
np.random.seed(42)
# Simulated "true" gradient direction (what we want to learn)
true_gradient = np.array([1.0, 0.5])
all_estimates = []
for trial in range(n_trials):
estimates = []
for ep in range(n_episodes):
# Simulate noisy return (high variance)
return_noise = np.random.normal(0, 50)
noisy_return = 100 + return_noise
# Simulate noisy gradient direction
direction_noise = np.random.normal(0, 0.5, size=2)
noisy_direction = true_gradient + direction_noise
noisy_direction = noisy_direction / np.linalg.norm(noisy_direction)
# Gradient estimate = direction * return
gradient_estimate = noisy_direction * noisy_return
estimates.append(gradient_estimate)
all_estimates.append(estimates)
return np.array(all_estimates)
def plot_gradient_variance(estimates):
"""Plot gradient estimates to show variance."""
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Plot 1: Gradient magnitude over episodes
magnitudes = np.linalg.norm(estimates[0], axis=1)
axes[0].plot(magnitudes, alpha=0.7)
axes[0].axhline(y=np.mean(magnitudes), color='r', linestyle='--', label='Mean')
axes[0].fill_between(
range(len(magnitudes)),
np.mean(magnitudes) - np.std(magnitudes),
np.mean(magnitudes) + np.std(magnitudes),
alpha=0.3, color='r'
)
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Gradient Magnitude')
axes[0].set_title('Gradient Magnitude Variance')
axes[0].legend()
# Plot 2: Gradient directions
for i in range(min(50, len(estimates[0]))):
g = estimates[0][i]
axes[1].arrow(0, 0, g[0]/20, g[1]/20, head_width=0.2, alpha=0.3, color='blue')
axes[1].set_xlim(-10, 10)
axes[1].set_ylim(-10, 10)
axes[1].set_xlabel('Gradient dim 1')
axes[1].set_ylabel('Gradient dim 2')
axes[1].set_title('Gradient Direction Variance')
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.show()
# Run simulation
estimates = simulate_gradient_variance()
# plot_gradient_variance(estimates) # Uncomment to visualize
# Print variance statistics
gradient_magnitudes = np.linalg.norm(estimates[0], axis=1)
print(f"Mean gradient magnitude: {np.mean(gradient_magnitudes):.2f}")
print(f"Std gradient magnitude: {np.std(gradient_magnitudes):.2f}")
print(f"Coefficient of variation: {np.std(gradient_magnitudes)/np.mean(gradient_magnitudes):.2f}")Impact on Learning
High variance has several negative effects:
-
Slow learning: Each update is unreliable, so we need many samples to make progress
-
Requires low learning rate: High-variance gradients with high learning rates cause wild oscillations or divergence
-
Unstable training: Performance can improve then suddenly collapse due to a series of unlucky gradient estimates
-
Sample inefficiency: We need many more episodes than if gradients were accurate
CartPole with REINFORCE
On CartPole, REINFORCE typically needs 500-1000 episodes to solve the task. With the same number of environment steps, DQN (with replay buffer) can solve it faster because it:
- Reuses experience (sample efficiency)
- Bootstraps from value estimates (lower variance)
The high variance of REINFORCE’s Monte Carlo returns means each gradient update is noisy, requiring more updates overall.
Measuring Variance
We can quantify variance by looking at gradient estimates across episodes.
For a single parameter , the gradient estimate is:
The variance is:
This variance is typically high because:
- has high variance (accumulates many random rewards)
- The sum over can have correlated terms
- Different episodes visit different states
def estimate_gradient_variance(policy, env, n_episodes=100, gamma=0.99):
"""
Estimate variance of policy gradient by sampling many episodes.
Returns:
mean_gradient: Average gradient estimate
gradient_variance: Variance of gradient estimates
"""
all_gradients = []
for _ in range(n_episodes):
# Collect episode
states, actions, rewards = [], [], []
state, _ = env.reset()
done = False
while not done:
state_t = torch.tensor(state, dtype=torch.float32)
states.append(state_t)
with torch.no_grad():
action = policy.sample(state_t.unsqueeze(0))
actions.append(action)
state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
rewards.append(reward)
# Compute returns
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
# Compute gradient for this episode
policy.zero_grad()
states_tensor = torch.stack(states)
actions_tensor = torch.tensor(actions)
log_probs = policy.log_prob(states_tensor, actions_tensor)
loss = -(log_probs * returns).sum()
loss.backward()
# Store gradient
episode_gradient = []
for param in policy.parameters():
if param.grad is not None:
episode_gradient.append(param.grad.clone().flatten())
episode_gradient = torch.cat(episode_gradient)
all_gradients.append(episode_gradient)
# Compute statistics
all_gradients = torch.stack(all_gradients)
mean_gradient = all_gradients.mean(dim=0)
gradient_variance = all_gradients.var(dim=0)
return mean_gradient, gradient_varianceWhy Does This Matter?
High variance is the fundamental limitation of REINFORCE. It explains:
Why learning is slow: We need many samples to average out the noise
Why hyperparameters are sensitive: A learning rate that works for low-noise gradients causes instability with high-noise gradients
Why we need improvements: Baselines, actor-critic methods, and other techniques exist primarily to reduce variance
Understanding the variance problem motivates everything that comes next in policy gradient methods.
Preview: Variance Reduction
The good news: we can reduce variance dramatically while keeping the gradient unbiased. The key techniques are:
-
Baselines (next section): Subtract a baseline from returns to reduce their magnitude without changing the expected gradient
-
Actor-Critic (next chapter): Use value function estimates instead of Monte Carlo returns, trading some bias for much lower variance
-
GAE (later): Generalized Advantage Estimation blends MC and TD to optimize the bias-variance tradeoff
The variance problem isn’t a flaw to be ashamed of - it’s a fundamental property of Monte Carlo estimation. Recognizing it helps us understand why modern algorithms like PPO include so many variance reduction techniques.
Summary
The variance problem in REINFORCE arises from:
- Stochastic environments: Same actions lead to different outcomes
- Stochastic policies: Different actions sampled each episode
- Long time horizons: Returns accumulate randomness over many steps
- Return-weighted gradients: High returns create high-magnitude gradients
This leads to:
- Slow learning
- Unstable training
- Need for low learning rates
- Sample inefficiency
The next section introduces baselines - our first tool for taming this variance.