Policy Gradient Methods • Part 3 of 3
📝Draft

The Policy Objective

What we're trying to maximize

The Policy Objective

We have a parameterized stochastic policy πθ(as)\pi_\theta(a|s). Now we need to define what “good” means - what objective should we maximize by adjusting θ\theta?

The Goal: Maximize Expected Return

The answer is straightforward: we want the policy that gets the most reward on average. If we follow the policy many times, we want the average total reward to be as high as possible.

This is the expected return - the reward we expect to accumulate when following the policy.

📖Policy Objective

The policy objective J(θ)J(\theta) is the expected return when starting from the initial state distribution and following policy πθ\pi_\theta:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

where:

  • τ=(s0,a0,r1,s1,a1,r2,...)\tau = (s_0, a_0, r_1, s_1, a_1, r_2, ...) is a trajectory
  • R(τ)=t=0T1γtrt+1R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_{t+1} is the discounted return
  • The expectation is over trajectories sampled by following πθ\pi_\theta

Our goal is to find the optimal parameters: θ=argmaxθJ(θ)\theta^* = \arg\max_\theta J(\theta)

Understanding the Expectation

Mathematical Details

The expected return can be written explicitly as:

J(θ)=τP(τθ)R(τ)J(\theta) = \sum_\tau P(\tau | \theta) R(\tau)

where P(τθ)P(\tau | \theta) is the probability of trajectory τ\tau under policy πθ\pi_\theta.

This probability factors as:

P(τθ)=p(s0)t=0T1πθ(atst)p(st+1st,at)P(\tau | \theta) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) \cdot p(s_{t+1} | s_t, a_t)

Notice that P(τθ)P(\tau | \theta) depends on:

  • The policy πθ\pi_\theta (which we control)
  • The environment dynamics p(st+1st,at)p(s_{t+1} | s_t, a_t) (which we don’t control and often don’t know)

The trajectory probability combines two factors at each step:

  1. How likely is the agent to choose this action? - determined by the policy
  2. How likely is the environment to transition to this state? - determined by physics

We can change the first factor by changing θ\theta. The second factor is fixed by the environment.

Alternative Objective Formulations

Depending on the setting, there are several equivalent ways to express the objective.

Mathematical Details

Episodic tasks (finite horizon): J(θ)=Es0,a0,...[t=0T1γtrt+1]J(\theta) = \mathbb{E}_{s_0, a_0, ...} \left[ \sum_{t=0}^{T-1} \gamma^t r_{t+1} \right]

where episodes terminate after at most TT steps.

Continuing tasks with discounting: J(θ)=Es0d0[Vπθ(s0)]J(\theta) = \mathbb{E}_{s_0 \sim d_0} \left[ V^{\pi_\theta}(s_0) \right]

where d0d_0 is the initial state distribution and VπθV^{\pi_\theta} is the value function under πθ\pi_\theta.

Average reward (undiscounted continuing tasks): J(θ)=limT1TE[t=0T1rt+1]J(\theta) = \lim_{T \to \infty} \frac{1}{T} \mathbb{E} \left[ \sum_{t=0}^{T-1} r_{t+1} \right]

For most practical purposes, the episodic formulation with discounting works well.

Gradient Ascent on J(θ)J(\theta)

We want to find θ\theta that maximizes J(θ)J(\theta). The natural approach is gradient ascent:

  1. Compute the gradient θJ(θ)\nabla_\theta J(\theta) - how does expected return change as we tweak θ\theta?
  2. Update θ\theta in the direction that increases JJ: θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)
  3. Repeat until convergence

This is just like gradient descent in supervised learning, but we’re maximizing (ascending) instead of minimizing.

Mathematical Details

The gradient ascent update rule:

θt+1=θt+αθJ(θt)\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)

where α>0\alpha > 0 is the learning rate.

If we can compute θJ(θ)\nabla_\theta J(\theta), we can iteratively improve the policy. The challenge is that J(θ)J(\theta) involves:

  • An expectation over trajectories
  • Environment dynamics that we don’t know

The Challenge: Computing the Gradient

Here’s the problem. We want:

θJ(θ)=θEτπθ[R(τ)]\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

But the expectation is over trajectories that depend on θ\theta. When we change θ\theta, we change which trajectories are likely to occur.

It’s like trying to compute “if I change my driving route, how does my average commute time change?” - but when you change your route, you encounter different traffic patterns and the whole distribution shifts.

Mathematical Details

Writing out the gradient:

θJ(θ)=θτP(τθ)R(τ)\nabla_\theta J(\theta) = \nabla_\theta \sum_\tau P(\tau | \theta) R(\tau)

If we try to pull the gradient inside:

=τθP(τθ)R(τ)= \sum_\tau \nabla_\theta P(\tau | \theta) \cdot R(\tau)

The problem: P(τθ)=p(s0)tπθ(atst)p(st+1st,at)P(\tau | \theta) = p(s_0) \prod_t \pi_\theta(a_t|s_t) \cdot p(s_{t+1}|s_t, a_t)

We can differentiate πθ(atst)\pi_\theta(a_t|s_t), but p(st+1st,at)p(s_{t+1}|s_t, a_t) is the unknown environment dynamics. How do we differentiate through something we don’t know?

Preview: The Policy Gradient Theorem

The remarkable insight of the policy gradient theorem is that we can compute the gradient without knowing the environment dynamics. The gradient turns out to be:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]

This is an expectation over trajectories from our current policy, involving only:

  • The gradient of the log-policy (which we know)
  • The returns (which we can observe)

We don’t need to know environment dynamics - we just sample trajectories and compute this estimate!

Estimating the Objective

Even if we can’t compute the true gradient, we can estimate the objective from samples.

</>Implementation
import numpy as np
import torch

def estimate_objective(policy, env, n_episodes=100, gamma=0.99):
    """
    Estimate J(theta) by sampling episodes.

    Args:
        policy: A policy with .sample(state) method
        env: Gym-like environment
        n_episodes: Number of episodes to sample
        gamma: Discount factor

    Returns:
        Mean return (estimate of J(theta))
    """
    returns = []

    for _ in range(n_episodes):
        state, _ = env.reset()
        done = False
        episode_return = 0
        discount = 1.0

        while not done:
            # Sample action from policy
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            action = policy.sample(state_tensor)

            # Take step in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Accumulate discounted return
            episode_return += discount * reward
            discount *= gamma

            state = next_state

        returns.append(episode_return)

    return np.mean(returns), np.std(returns) / np.sqrt(n_episodes)


# Note: This estimates J(theta), not its gradient
# The gradient estimation is more involved (next chapter)

Finite Differences (A Naive Approach)

One way to estimate the gradient is finite differences: perturb each parameter slightly, measure how JJ changes, and use that as a gradient estimate.

JθiJ(θ+ϵei)J(θϵei)2ϵ\frac{\partial J}{\partial \theta_i} \approx \frac{J(\theta + \epsilon e_i) - J(\theta - \epsilon e_i)}{2\epsilon}

where eie_i is a unit vector in direction ii.

This works but is extremely inefficient - you need at least two objective evaluations per parameter, and each evaluation requires many episodes.

</>Implementation
def finite_difference_gradient(policy, env, theta, epsilon=0.01, n_episodes=50):
    """
    Estimate gradient via finite differences.

    WARNING: Very slow! O(|theta| * n_episodes) environment interactions.
    """
    gradient = np.zeros_like(theta)

    for i in range(len(theta)):
        # Perturb positively
        theta_plus = theta.copy()
        theta_plus[i] += epsilon
        policy.set_params(theta_plus)
        J_plus = estimate_objective(policy, env, n_episodes)[0]

        # Perturb negatively
        theta_minus = theta.copy()
        theta_minus[i] -= epsilon
        policy.set_params(theta_minus)
        J_minus = estimate_objective(policy, env, n_episodes)[0]

        # Central difference estimate
        gradient[i] = (J_plus - J_minus) / (2 * epsilon)

    # Restore original parameters
    policy.set_params(theta)

    return gradient

The Landscape of J(θ)J(\theta)

The objective J(θ)J(\theta) is a function of potentially millions of parameters. Its landscape can be:

  • Non-convex: Multiple local optima exist
  • Plateaus: Regions where gradients are near-zero
  • Cliffs: Regions where small changes cause large performance drops

Policy gradient methods navigate this landscape using stochastic gradient ascent. They don’t guarantee finding the global optimum, but they often find good local optima.

📌Example

Consider a simple policy with two parameters (θ1,θ2)(\theta_1, \theta_2). The objective landscape might look like a bumpy surface with multiple peaks:

  • Global maximum: The best possible policy
  • Local maxima: Good policies that aren’t optimal
  • Saddle points: Flat regions that aren’t optima

Gradient ascent will converge to a local maximum (or saddle point). Different initializations may find different optima.

Summary

The policy objective captures what we want to optimize:

  • Objective: Expected return J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
  • Goal: Find θ=argmaxθJ(θ)\theta^* = \arg\max_\theta J(\theta)
  • Method: Gradient ascent θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)
  • Challenge: Computing θJ(θ)\nabla_\theta J(\theta) when the expectation depends on θ\theta

The next chapter reveals how the policy gradient theorem solves this challenge, enabling us to estimate gradients from samples alone.