REINFORCE and the Policy Gradient Theorem
What You'll Learn
- State and explain the Policy Gradient Theorem
- Implement the REINFORCE algorithm from scratch
- Understand the log-derivative trick and why we use log probabilities
- Identify and explain the variance problem in REINFORCE
- Implement baselines to reduce gradient variance
We want to improve our policy by gradient ascent. But there’s a problem: the gradient of expected return involves the environment dynamics, which we don’t know. The Policy Gradient Theorem provides an elegant solution - we can compute the gradient using only samples from our policy.
Why REINFORCE?
In value-based methods like Q-learning, we learned a value function and derived a policy from it. REINFORCE takes a fundamentally different approach: optimize the policy directly.
The key insight is that we can compute policy gradients without knowing how the environment works. We just need to:
- Sample trajectories from our current policy
- Compute returns for those trajectories
- Update the policy to make high-return actions more likely
Chapter Overview
This chapter derives the Policy Gradient Theorem and introduces REINFORCE, the simplest policy gradient algorithm. We’ll also tackle its main weakness - high variance - and introduce baselines as a solution.
The Policy Gradient Theorem
The mathematical foundation for computing policy gradients
The REINFORCE Algorithm
Monte Carlo policy gradients in action
The Variance Problem
Why REINFORCE gradients are noisy and unstable
Baselines
Reducing variance without introducing bias
The Big Picture
REINFORCE follows a simple recipe:
- Collect a complete episode using the current policy
- Compute the return (cumulative reward) from each timestep
- Update the policy to increase the probability of actions that led to high returns
The gradient of the expected return with respect to the policy parameters. It tells us how to adjust the policy to increase expected reward.
The Policy Gradient Theorem shows us that this gradient has a beautiful form:
High-return actions get reinforced; low-return actions get suppressed. That’s the essence of REINFORCE.
Prerequisites
This chapter assumes familiarity with:
- The policy gradient objective from Introduction to Policy Gradients
- Stochastic policies and probability distributions over actions
- Basic calculus (gradients, chain rule)
Key Takeaways
- The Policy Gradient Theorem enables gradient computation without knowing environment dynamics
- The log-derivative trick converts the gradient into a weighted sum of log-probabilities
- REINFORCE updates:
- High variance is REINFORCE’s main weakness - it requires complete episodes and noisy return estimates
- Baselines reduce variance without changing the expected gradient