The Policy Gradient Theorem
The Policy Gradient Theorem is the theoretical cornerstone of all policy gradient methods. It answers a crucial question: how do we compute the gradient of expected return when the return depends on environment dynamics we don’t know?
The Problem
We want to improve our policy by gradient ascent on expected return:
But computing seems to require knowing how the environment works. When we change the policy, different trajectories become more or less likely, and those trajectories depend on environment dynamics.
The remarkable insight of the policy gradient theorem is that we can compute this gradient using only samples from our policy - no need to know transition probabilities.
The Log-Derivative Trick
The key mathematical tool is the log-derivative trick (also called the REINFORCE trick or score function trick).
For any probability distribution that depends on parameters :
Proof:
Starting from the chain rule:
Rearranging:
Why this matters: The left side has , which is hard to work with in expectations. The right side has times something, which can be converted to an expectation!
The log-derivative trick converts a gradient of a probability into an expectation. This is crucial because:
- We can’t easily compute (the gradient of trajectory probability)
- But we CAN sample trajectories from
- And we CAN compute for our policy
The trick lets us estimate gradients by sampling.
Deriving the Policy Gradient Theorem
Let’s derive the policy gradient step by step.
Step 1: Write out the objective
Step 2: Take the gradient
Step 3: Apply the log-derivative trick
Step 4: Expand the trajectory probability
Taking the log:
Step 5: Take gradient (key insight!)
The initial state distribution and transition dynamics don’t depend on , so their gradients are zero!
Step 6: Final result
The gradient of expected return equals the expected sum of log-policy gradients, weighted by trajectory return.
Intuition: Why This Works
Let’s understand why multiplying by returns makes sense.
Consider a single trajectory where we took actions and got return .
If is high (good trajectory):
- We want to increase for all the actions we took
- points in the direction that increases action probability
- Multiplying by positive pushes to make these actions more likely
If is low (bad trajectory):
- We want to decrease for these actions
- Multiplying by small or negative pushes to make these actions less likely
The gradient naturally reinforces good actions and suppresses bad ones!
Bandit Example
Consider a 2-armed bandit with softmax policy:
Suppose we sample action and get reward .
The gradient points toward increasing relative to .
Multiplied by , this creates a strong push toward action .
If instead we got , the push would be weak.
This is exactly what we want: good outcomes create strong updates, poor outcomes create weak updates.
The Reward-to-Go Refinement
The basic theorem uses the full trajectory return for every timestep. We can do better.
Key observation: Action can only affect rewards at times . It cannot affect rewards at earlier times.
Therefore, we can use the reward-to-go (also called return from time ):
The refined policy gradient theorem:
This is mathematically equivalent but has lower variance because we remove the irrelevant past rewards.
Imagine you’re at timestep . The rewards from steps 1-5 have already happened - nothing you do now can change them. So why should they influence your gradient?
Using reward-to-go means each action is only credited for future rewards it could have influenced. This makes the gradient estimate less noisy.
where is the return from time .
Estimating the Gradient from Samples
Since the gradient is an expectation, we can estimate it with samples:
where:
- is the number of sampled trajectories
- is the length of trajectory
- is the state-action pair at time in trajectory
- is the reward-to-go from time in trajectory
This is an unbiased estimate: its expected value equals the true gradient.
import torch
import torch.nn as nn
import numpy as np
def compute_returns(rewards, gamma=0.99):
"""
Compute reward-to-go for each timestep.
Args:
rewards: List of rewards [r_1, r_2, ..., r_T]
gamma: Discount factor
Returns:
List of returns [G_0, G_1, ..., G_{T-1}]
"""
returns = []
G = 0
# Work backwards from end of episode
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
return returns
def estimate_policy_gradient(policy, trajectories):
"""
Estimate policy gradient from sampled trajectories.
Args:
policy: Policy network with .log_prob(state, action) method
trajectories: List of (states, actions, rewards) tuples
Returns:
Gradient estimate (accumulated in policy.parameters())
"""
policy_loss = 0
for states, actions, rewards in trajectories:
# Compute returns for this trajectory
returns = compute_returns(rewards)
returns = torch.tensor(returns, dtype=torch.float32)
# Compute log probabilities
states_t = torch.stack(states)
actions_t = torch.tensor(actions)
log_probs = policy.log_prob(states_t, actions_t)
# Policy gradient loss: -log_prob * return
# Negative because we minimize loss (equivalent to maximizing J)
trajectory_loss = -(log_probs * returns).sum()
policy_loss = policy_loss + trajectory_loss
# Average over trajectories
policy_loss = policy_loss / len(trajectories)
return policy_loss
# Example usage
class SimplePolicy(nn.Module):
def __init__(self, state_dim, n_actions):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 64),
nn.Tanh(),
nn.Linear(64, n_actions)
)
def forward(self, state):
return torch.softmax(self.net(state), dim=-1)
def log_prob(self, state, action):
probs = self.forward(state)
return torch.log(probs.gather(1, action.unsqueeze(1)).squeeze(1) + 1e-10)
# After collecting trajectories:
# loss = estimate_policy_gradient(policy, trajectories)
# loss.backward()
# optimizer.step()Connection to Maximum Likelihood
The policy gradient has a nice interpretation as weighted maximum likelihood.
In supervised learning, maximum likelihood says: “increase the probability of the actions in the training data.”
Policy gradient says the same thing, but weighted by returns:
“Increase the probability of actions, but weight high-return actions more heavily.”
This is like supervised learning where the labels (actions) come from our own experience, and we trust high-return experiences more.
The gradient update can be written as:
Compare to maximum likelihood estimation:
The only difference is the weighting by . This weighting turns undifferentiated imitation into selective reinforcement of good behaviors.
The Score Function
The score function (or score) is defined as:
It represents “the direction in parameter space that increases the log-probability of action in state .”
Key properties:
- Its expected value is zero:
- This is why baselines work (next sections)
Proof that the expected score is zero:
Using the log-derivative trick backwards:
This property is crucial for understanding baselines.
Summary
The Policy Gradient Theorem provides a practical way to compute policy gradients:
- The log-derivative trick converts gradients of probabilities into expectations
- Environment dynamics cancel out - we only need to differentiate through our policy
- Sample-based estimation - we can estimate the gradient by sampling trajectories
- Reward-to-go reduces variance by only crediting actions for future rewards
- Intuition: High-return trajectories reinforce the actions taken; low-return trajectories suppress them
The next section puts this into a complete algorithm: REINFORCE.