Policy Gradient Methods • Part 3 of 4
📝Draft

The Variance Problem

Why REINFORCE needs help

The Variance Problem

REINFORCE is elegant and simple, but it suffers from a critical weakness: high variance. The gradient estimates can vary wildly from episode to episode, making learning slow and unstable. Understanding why this happens is key to appreciating the improvements that come next.

What Is the Problem?

Imagine you’re learning to play basketball. You take 10 shots and make 6 of them. Was that a good shooting session?

It depends. If you’re usually a 70% shooter, making 60% is below average - maybe your form was off. If you’re usually a 30% shooter, 60% is excellent - something went right.

The problem is: any single session is a noisy estimate of your true ability. You might have been lucky or unlucky.

REINFORCE has the same issue. Each episode gives you one sample of “how good these actions were.” But that sample is contaminated by:

  • Random environment outcomes
  • Random action selections (stochastic policy)
  • Lucky or unlucky reward sequences

One episode is not enough to know if an action is truly good or just got lucky.

Sources of Variance

1. Stochastic Environment

Even if you take the same actions, the environment might give different outcomes.

Consider a robot learning to walk. The same leg movement might:

  • Work perfectly on a stable surface
  • Cause a stumble on an uneven surface
  • Lead to a fall if there’s an unexpected bump

The return varies due to environment randomness, not action quality.

2. Stochastic Policy

A stochastic policy samples different actions in the same state. This is good for exploration, but adds variance.

In state ss, your policy might:

  • Sample action A (60% probability) and get reward 10
  • Sample action B (40% probability) and get reward 5

Over many episodes, the average evens out. But each individual episode has different actions, leading to different returns.

3. Long Episodes and Credit Assignment

In a long episode, which action caused the final outcome?

Imagine a 100-step episode where you win at the end. REINFORCE credits ALL 100 actions equally with the positive return. But maybe:

  • Only the last 10 actions mattered
  • Early actions were irrelevant
  • Some middle actions were actually mistakes that you recovered from

The return GtG_t contains all future rewards, making it hard to isolate which action deserves credit.

Mathematical Details

The return from timestep tt includes rewards from all future steps:

Gt=rt+1+γrt+2+γ2rt+3+...+γTt1rTG_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... + \gamma^{T-t-1} r_T

This accumulates randomness from:

  • Stochastic rewards at each step
  • Stochastic transitions affecting future states
  • Stochastic future action selections

The variance compounds over the episode length.

4. Multiplying by Returns

Mathematical Details

The gradient estimate is:

g^=θlogπθ(as)Gt\hat{g} = \nabla_\theta \log \pi_\theta(a|s) \cdot G_t

The variance of this product depends on the variance of GtG_t. If returns range from 0 to 1000, the gradient estimates will also vary by orders of magnitude.

High-magnitude returns create high-variance gradients, even if they’re equally “good” relative to the task.

Visualizing the Problem

Consider two training runs of REINFORCE on the same task:

Run 1 (lucky episodes):

  • Episode 1: Return = 150 (got lucky early)
  • Episode 2: Return = 180 (good actions + luck)
  • Episode 3: Return = 90 (unlucky despite good actions)

Run 2 (unlucky episodes):

  • Episode 1: Return = 50 (unlucky start)
  • Episode 2: Return = 70 (mediocre luck)
  • Episode 3: Return = 200 (finally got lucky)

Both runs might have similar underlying policy quality, but the gradient updates are completely different. Run 1 strongly reinforces early actions; Run 2 strongly reinforces late actions.

</>Implementation
import numpy as np
import matplotlib.pyplot as plt

def simulate_gradient_variance(n_episodes=100, n_trials=5):
    """
    Simulate how gradient estimates vary across episodes.

    This is a simplified demonstration - we show that
    different episodes give very different gradient signals.
    """
    np.random.seed(42)

    # Simulated "true" gradient direction (what we want to learn)
    true_gradient = np.array([1.0, 0.5])

    all_estimates = []

    for trial in range(n_trials):
        estimates = []
        for ep in range(n_episodes):
            # Simulate noisy return (high variance)
            return_noise = np.random.normal(0, 50)
            noisy_return = 100 + return_noise

            # Simulate noisy gradient direction
            direction_noise = np.random.normal(0, 0.5, size=2)
            noisy_direction = true_gradient + direction_noise
            noisy_direction = noisy_direction / np.linalg.norm(noisy_direction)

            # Gradient estimate = direction * return
            gradient_estimate = noisy_direction * noisy_return
            estimates.append(gradient_estimate)

        all_estimates.append(estimates)

    return np.array(all_estimates)


def plot_gradient_variance(estimates):
    """Plot gradient estimates to show variance."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Plot 1: Gradient magnitude over episodes
    magnitudes = np.linalg.norm(estimates[0], axis=1)
    axes[0].plot(magnitudes, alpha=0.7)
    axes[0].axhline(y=np.mean(magnitudes), color='r', linestyle='--', label='Mean')
    axes[0].fill_between(
        range(len(magnitudes)),
        np.mean(magnitudes) - np.std(magnitudes),
        np.mean(magnitudes) + np.std(magnitudes),
        alpha=0.3, color='r'
    )
    axes[0].set_xlabel('Episode')
    axes[0].set_ylabel('Gradient Magnitude')
    axes[0].set_title('Gradient Magnitude Variance')
    axes[0].legend()

    # Plot 2: Gradient directions
    for i in range(min(50, len(estimates[0]))):
        g = estimates[0][i]
        axes[1].arrow(0, 0, g[0]/20, g[1]/20, head_width=0.2, alpha=0.3, color='blue')
    axes[1].set_xlim(-10, 10)
    axes[1].set_ylim(-10, 10)
    axes[1].set_xlabel('Gradient dim 1')
    axes[1].set_ylabel('Gradient dim 2')
    axes[1].set_title('Gradient Direction Variance')
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    axes[1].axvline(x=0, color='k', linewidth=0.5)

    plt.tight_layout()
    plt.show()


# Run simulation
estimates = simulate_gradient_variance()
# plot_gradient_variance(estimates)  # Uncomment to visualize

# Print variance statistics
gradient_magnitudes = np.linalg.norm(estimates[0], axis=1)
print(f"Mean gradient magnitude: {np.mean(gradient_magnitudes):.2f}")
print(f"Std gradient magnitude: {np.std(gradient_magnitudes):.2f}")
print(f"Coefficient of variation: {np.std(gradient_magnitudes)/np.mean(gradient_magnitudes):.2f}")

Impact on Learning

📌Example

CartPole with REINFORCE

On CartPole, REINFORCE typically needs 500-1000 episodes to solve the task. With the same number of environment steps, DQN (with replay buffer) can solve it faster because it:

  • Reuses experience (sample efficiency)
  • Bootstraps from value estimates (lower variance)

The high variance of REINFORCE’s Monte Carlo returns means each gradient update is noisy, requiring more updates overall.

Measuring Variance

Mathematical Details

We can quantify variance by looking at gradient estimates across episodes.

For a single parameter θi\theta_i, the gradient estimate is:

g^i=tlogπθ(atst)θiGt\hat{g}_i = \sum_t \frac{\partial \log \pi_\theta(a_t|s_t)}{\partial \theta_i} \cdot G_t

The variance is:

Var[g^i]=E[g^i2]E[g^i]2\text{Var}[\hat{g}_i] = \mathbb{E}[\hat{g}_i^2] - \mathbb{E}[\hat{g}_i]^2

This variance is typically high because:

  • GtG_t has high variance (accumulates many random rewards)
  • The sum over tt can have correlated terms
  • Different episodes visit different states
</>Implementation
def estimate_gradient_variance(policy, env, n_episodes=100, gamma=0.99):
    """
    Estimate variance of policy gradient by sampling many episodes.

    Returns:
        mean_gradient: Average gradient estimate
        gradient_variance: Variance of gradient estimates
    """
    all_gradients = []

    for _ in range(n_episodes):
        # Collect episode
        states, actions, rewards = [], [], []
        state, _ = env.reset()
        done = False

        while not done:
            state_t = torch.tensor(state, dtype=torch.float32)
            states.append(state_t)

            with torch.no_grad():
                action = policy.sample(state_t.unsqueeze(0))
            actions.append(action)

            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            rewards.append(reward)

        # Compute returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns, dtype=torch.float32)

        # Compute gradient for this episode
        policy.zero_grad()
        states_tensor = torch.stack(states)
        actions_tensor = torch.tensor(actions)

        log_probs = policy.log_prob(states_tensor, actions_tensor)
        loss = -(log_probs * returns).sum()
        loss.backward()

        # Store gradient
        episode_gradient = []
        for param in policy.parameters():
            if param.grad is not None:
                episode_gradient.append(param.grad.clone().flatten())
        episode_gradient = torch.cat(episode_gradient)
        all_gradients.append(episode_gradient)

    # Compute statistics
    all_gradients = torch.stack(all_gradients)
    mean_gradient = all_gradients.mean(dim=0)
    gradient_variance = all_gradients.var(dim=0)

    return mean_gradient, gradient_variance

Why Does This Matter?

High variance is the fundamental limitation of REINFORCE. It explains:

Why learning is slow: We need many samples to average out the noise

Why hyperparameters are sensitive: A learning rate that works for low-noise gradients causes instability with high-noise gradients

Why we need improvements: Baselines, actor-critic methods, and other techniques exist primarily to reduce variance

Understanding the variance problem motivates everything that comes next in policy gradient methods.

Preview: Variance Reduction

The good news: we can reduce variance dramatically while keeping the gradient unbiased. The key techniques are:

  1. Baselines (next section): Subtract a baseline from returns to reduce their magnitude without changing the expected gradient

  2. Actor-Critic (next chapter): Use value function estimates instead of Monte Carlo returns, trading some bias for much lower variance

  3. GAE (later): Generalized Advantage Estimation blends MC and TD to optimize the bias-variance tradeoff

💡Tip

The variance problem isn’t a flaw to be ashamed of - it’s a fundamental property of Monte Carlo estimation. Recognizing it helps us understand why modern algorithms like PPO include so many variance reduction techniques.

Summary

The variance problem in REINFORCE arises from:

  • Stochastic environments: Same actions lead to different outcomes
  • Stochastic policies: Different actions sampled each episode
  • Long time horizons: Returns accumulate randomness over many steps
  • Return-weighted gradients: High returns create high-magnitude gradients

This leads to:

  • Slow learning
  • Unstable training
  • Need for low learning rates
  • Sample inefficiency

The next section introduces baselines - our first tool for taming this variance.