REINFORCE and the Policy Gradient Theorem

What You'll Learn

State and explain the Policy Gradient Theorem
Implement the REINFORCE algorithm from scratch
Understand the log-derivative trick and why we use log probabilities
Identify and explain the variance problem in REINFORCE
Implement baselines to reduce gradient variance

We want to improve our policy by gradient ascent. But there’s a problem: the gradient of expected return involves the environment dynamics, which we don’t know. The Policy Gradient Theorem provides an elegant solution - we can compute the gradient using only samples from our policy.

Why REINFORCE?

In value-based methods like Q-learning, we learned a value function and derived a policy from it. REINFORCE takes a fundamentally different approach: optimize the policy directly.

The key insight is that we can compute policy gradients without knowing how the environment works. We just need to:

Sample trajectories from our current policy
Compute returns for those trajectories
Update the policy to make high-return actions more likely

Chapter Overview

This chapter derives the Policy Gradient Theorem and introduces REINFORCE, the simplest policy gradient algorithm. We’ll also tackle its main weakness - high variance - and introduce baselines as a solution.

The Policy Gradient Theorem

The mathematical foundation for computing policy gradients

The REINFORCE Algorithm

Monte Carlo policy gradients in action

The Variance Problem

Why REINFORCE gradients are noisy and unstable

Baselines

Reducing variance without introducing bias

The Big Picture

REINFORCE follows a simple recipe:

Collect a complete episode using the current policy
Compute the return (cumulative reward) from each timestep
Update the policy to increase the probability of actions that led to high returns

📖Policy Gradient

The gradient of the expected return with respect to the policy parameters. It tells us how to adjust the policy to increase expected reward.

The Policy Gradient Theorem shows us that this gradient has a beautiful form:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$

High-return actions get reinforced; low-return actions get suppressed. That’s the essence of REINFORCE.

Prerequisites

This chapter assumes familiarity with:

The policy gradient objective from Introduction to Policy Gradients
Stochastic policies and probability distributions over actions
Basic calculus (gradients, chain rule)

Key Takeaways

The Policy Gradient Theorem enables gradient computation without knowing environment dynamics
The log-derivative trick converts the gradient into a weighted sum of log-probabilities
REINFORCE updates: $\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$
High variance is REINFORCE’s main weakness - it requires complete episodes and noisy return estimates
Baselines reduce variance without changing the expected gradient

Next ChapterActor-Critic Methods→