Policy Gradient Methods • Part 1 of 4
📝Draft

The Policy Gradient Theorem

How to compute gradients for policies

The Policy Gradient Theorem

The Policy Gradient Theorem is the theoretical cornerstone of all policy gradient methods. It answers a crucial question: how do we compute the gradient of expected return when the return depends on environment dynamics we don’t know?

The Problem

We want to improve our policy by gradient ascent on expected return:

θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

But computing θJ(θ)\nabla_\theta J(\theta) seems to require knowing how the environment works. When we change the policy, different trajectories become more or less likely, and those trajectories depend on environment dynamics.

The remarkable insight of the policy gradient theorem is that we can compute this gradient using only samples from our policy - no need to know transition probabilities.

The Log-Derivative Trick

The key mathematical tool is the log-derivative trick (also called the REINFORCE trick or score function trick).

Mathematical Details

For any probability distribution pθ(x)p_\theta(x) that depends on parameters θ\theta:

θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta p_\theta(x) = p_\theta(x) \cdot \nabla_\theta \log p_\theta(x)

Proof:

Starting from the chain rule: θlogpθ(x)=θpθ(x)pθ(x)\nabla_\theta \log p_\theta(x) = \frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}

Rearranging: θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta p_\theta(x) = p_\theta(x) \cdot \nabla_\theta \log p_\theta(x)

Why this matters: The left side has θpθ(x)\nabla_\theta p_\theta(x), which is hard to work with in expectations. The right side has pθ(x)p_\theta(x) times something, which can be converted to an expectation!

The log-derivative trick converts a gradient of a probability into an expectation. This is crucial because:

  • We can’t easily compute θP(τθ)\nabla_\theta P(\tau|\theta) (the gradient of trajectory probability)
  • But we CAN sample trajectories from P(τθ)P(\tau|\theta)
  • And we CAN compute θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) for our policy

The trick lets us estimate gradients by sampling.

Deriving the Policy Gradient Theorem

Mathematical Details

Let’s derive the policy gradient step by step.

Step 1: Write out the objective

J(θ)=Eτπθ[R(τ)]=τP(τθ)R(τ)J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_\tau P(\tau|\theta) R(\tau)

Step 2: Take the gradient

θJ(θ)=τθP(τθ)R(τ)\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta P(\tau|\theta) \cdot R(\tau)

Step 3: Apply the log-derivative trick

=τP(τθ)θlogP(τθ)R(τ)= \sum_\tau P(\tau|\theta) \cdot \nabla_\theta \log P(\tau|\theta) \cdot R(\tau)

=Eτπθ[θlogP(τθ)R(τ)]= \mathbb{E}_{\tau \sim \pi_\theta}\left[ \nabla_\theta \log P(\tau|\theta) \cdot R(\tau) \right]

Step 4: Expand the trajectory probability

P(τθ)=p(s0)t=0T1πθ(atst)p(st+1st,at)P(\tau|\theta) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) \cdot p(s_{t+1}|s_t, a_t)

Taking the log:

logP(τθ)=logp(s0)+t=0T1[logπθ(atst)+logp(st+1st,at)]\log P(\tau|\theta) = \log p(s_0) + \sum_{t=0}^{T-1} \left[ \log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t) \right]

Step 5: Take gradient (key insight!)

θlogP(τθ)=t=0T1θlogπθ(atst)\nabla_\theta \log P(\tau|\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)

The initial state distribution p(s0)p(s_0) and transition dynamics p(st+1st,at)p(s_{t+1}|s_t, a_t) don’t depend on θ\theta, so their gradients are zero!

Step 6: Final result

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)R(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau) \right]

📖Policy Gradient Theorem (Simple Form)

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)R(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau) \right]

The gradient of expected return equals the expected sum of log-policy gradients, weighted by trajectory return.

Intuition: Why This Works

Let’s understand why multiplying by returns makes sense.

Consider a single trajectory where we took actions a0,a1,...,aT1a_0, a_1, ..., a_{T-1} and got return RR.

If RR is high (good trajectory):

  • We want to increase πθ(atst)\pi_\theta(a_t|s_t) for all the actions we took
  • θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t) points in the direction that increases action probability
  • Multiplying by positive RR pushes θ\theta to make these actions more likely

If RR is low (bad trajectory):

  • We want to decrease πθ(atst)\pi_\theta(a_t|s_t) for these actions
  • Multiplying by small or negative RR pushes θ\theta to make these actions less likely

The gradient naturally reinforces good actions and suppresses bad ones!

📌Example

Bandit Example

Consider a 2-armed bandit with softmax policy:

  • πθ(a1)=eθ1eθ1+eθ2\pi_\theta(a_1) = \frac{e^{\theta_1}}{e^{\theta_1} + e^{\theta_2}}
  • πθ(a2)=eθ2eθ1+eθ2\pi_\theta(a_2) = \frac{e^{\theta_2}}{e^{\theta_1} + e^{\theta_2}}

Suppose we sample action a1a_1 and get reward R=10R = 10.

The gradient θlogπθ(a1)\nabla_\theta \log \pi_\theta(a_1) points toward increasing θ1\theta_1 relative to θ2\theta_2.

Multiplied by R=10R = 10, this creates a strong push toward action a1a_1.

If instead we got R=0.1R = 0.1, the push would be weak.

This is exactly what we want: good outcomes create strong updates, poor outcomes create weak updates.

The Reward-to-Go Refinement

The basic theorem uses the full trajectory return R(τ)R(\tau) for every timestep. We can do better.

Mathematical Details

Key observation: Action ata_t can only affect rewards at times t+1,t+2,...t+1, t+2, .... It cannot affect rewards at earlier times.

Therefore, we can use the reward-to-go (also called return from time tt):

Gt=k=tT1γktrk+1G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1}

The refined policy gradient theorem:

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]

This is mathematically equivalent but has lower variance because we remove the irrelevant past rewards.

Imagine you’re at timestep t=5t=5. The rewards from steps 1-5 have already happened - nothing you do now can change them. So why should they influence your gradient?

Using reward-to-go means each action is only credited for future rewards it could have influenced. This makes the gradient estimate less noisy.

📖Policy Gradient Theorem (Reward-to-Go Form)

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]

where Gt=k=tT1γktrk+1G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1} is the return from time tt.

Estimating the Gradient from Samples

Mathematical Details

Since the gradient is an expectation, we can estimate it with samples:

θJ(θ)1Ni=1Nt=0Ti1θlogπθ(at(i)st(i))Gt(i)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T_i-1} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot G_t^{(i)}

where:

  • NN is the number of sampled trajectories
  • TiT_i is the length of trajectory ii
  • (st(i),at(i))(s_t^{(i)}, a_t^{(i)}) is the state-action pair at time tt in trajectory ii
  • Gt(i)G_t^{(i)} is the reward-to-go from time tt in trajectory ii

This is an unbiased estimate: its expected value equals the true gradient.

</>Implementation
import torch
import torch.nn as nn
import numpy as np

def compute_returns(rewards, gamma=0.99):
    """
    Compute reward-to-go for each timestep.

    Args:
        rewards: List of rewards [r_1, r_2, ..., r_T]
        gamma: Discount factor

    Returns:
        List of returns [G_0, G_1, ..., G_{T-1}]
    """
    returns = []
    G = 0
    # Work backwards from end of episode
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    return returns


def estimate_policy_gradient(policy, trajectories):
    """
    Estimate policy gradient from sampled trajectories.

    Args:
        policy: Policy network with .log_prob(state, action) method
        trajectories: List of (states, actions, rewards) tuples

    Returns:
        Gradient estimate (accumulated in policy.parameters())
    """
    policy_loss = 0

    for states, actions, rewards in trajectories:
        # Compute returns for this trajectory
        returns = compute_returns(rewards)
        returns = torch.tensor(returns, dtype=torch.float32)

        # Compute log probabilities
        states_t = torch.stack(states)
        actions_t = torch.tensor(actions)
        log_probs = policy.log_prob(states_t, actions_t)

        # Policy gradient loss: -log_prob * return
        # Negative because we minimize loss (equivalent to maximizing J)
        trajectory_loss = -(log_probs * returns).sum()
        policy_loss = policy_loss + trajectory_loss

    # Average over trajectories
    policy_loss = policy_loss / len(trajectories)

    return policy_loss


# Example usage
class SimplePolicy(nn.Module):
    def __init__(self, state_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.Tanh(),
            nn.Linear(64, n_actions)
        )

    def forward(self, state):
        return torch.softmax(self.net(state), dim=-1)

    def log_prob(self, state, action):
        probs = self.forward(state)
        return torch.log(probs.gather(1, action.unsqueeze(1)).squeeze(1) + 1e-10)


# After collecting trajectories:
# loss = estimate_policy_gradient(policy, trajectories)
# loss.backward()
# optimizer.step()

Connection to Maximum Likelihood

The policy gradient has a nice interpretation as weighted maximum likelihood.

In supervised learning, maximum likelihood says: “increase the probability of the actions in the training data.”

maxθtlogπθ(atst)\max_\theta \sum_t \log \pi_\theta(a_t|s_t)

Policy gradient says the same thing, but weighted by returns:

maxθtGtlogπθ(atst)\max_\theta \sum_t G_t \cdot \log \pi_\theta(a_t|s_t)

“Increase the probability of actions, but weight high-return actions more heavily.”

This is like supervised learning where the labels (actions) come from our own experience, and we trust high-return experiences more.

Mathematical Details

The gradient update can be written as:

θθ+αtGtθlogπθ(atst)\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi_\theta(a_t|s_t)

Compare to maximum likelihood estimation:

θθ+αtθlogπθ(atst)\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)

The only difference is the weighting by GtG_t. This weighting turns undifferentiated imitation into selective reinforcement of good behaviors.

The Score Function

📖Score Function

The score function (or score) is defined as:

θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s)

It represents “the direction in parameter space that increases the log-probability of action aa in state ss.”

Key properties:

  • Its expected value is zero: Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0
  • This is why baselines work (next sections)
Mathematical Details

Proof that the expected score is zero:

Eaπθ[θlogπθ(as)]=aπθ(as)θlogπθ(as)\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = \sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)

Using the log-derivative trick backwards:

=aθπθ(as)=θaπθ(as)=θ1=0= \sum_a \nabla_\theta \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0

This property is crucial for understanding baselines.

Summary

The Policy Gradient Theorem provides a practical way to compute policy gradients:

  • The log-derivative trick converts gradients of probabilities into expectations
  • Environment dynamics cancel out - we only need to differentiate through our policy
  • Sample-based estimation - we can estimate the gradient by sampling trajectories
  • Reward-to-go reduces variance by only crediting actions for future rewards
  • Intuition: High-return trajectories reinforce the actions taken; low-return trajectories suppress them

The next section puts this into a complete algorithm: REINFORCE.