The Policy Objective
We have a parameterized stochastic policy . Now we need to define what “good” means - what objective should we maximize by adjusting ?
The Goal: Maximize Expected Return
The answer is straightforward: we want the policy that gets the most reward on average. If we follow the policy many times, we want the average total reward to be as high as possible.
This is the expected return - the reward we expect to accumulate when following the policy.
The policy objective is the expected return when starting from the initial state distribution and following policy :
where:
- is a trajectory
- is the discounted return
- The expectation is over trajectories sampled by following
Our goal is to find the optimal parameters:
Understanding the Expectation
The expected return can be written explicitly as:
where is the probability of trajectory under policy .
This probability factors as:
Notice that depends on:
- The policy (which we control)
- The environment dynamics (which we don’t control and often don’t know)
The trajectory probability combines two factors at each step:
- How likely is the agent to choose this action? - determined by the policy
- How likely is the environment to transition to this state? - determined by physics
We can change the first factor by changing . The second factor is fixed by the environment.
Alternative Objective Formulations
Depending on the setting, there are several equivalent ways to express the objective.
Episodic tasks (finite horizon):
where episodes terminate after at most steps.
Continuing tasks with discounting:
where is the initial state distribution and is the value function under .
Average reward (undiscounted continuing tasks):
For most practical purposes, the episodic formulation with discounting works well.
Gradient Ascent on
We want to find that maximizes . The natural approach is gradient ascent:
- Compute the gradient - how does expected return change as we tweak ?
- Update in the direction that increases :
- Repeat until convergence
This is just like gradient descent in supervised learning, but we’re maximizing (ascending) instead of minimizing.
The gradient ascent update rule:
where is the learning rate.
If we can compute , we can iteratively improve the policy. The challenge is that involves:
- An expectation over trajectories
- Environment dynamics that we don’t know
The Challenge: Computing the Gradient
Here’s the problem. We want:
But the expectation is over trajectories that depend on . When we change , we change which trajectories are likely to occur.
It’s like trying to compute “if I change my driving route, how does my average commute time change?” - but when you change your route, you encounter different traffic patterns and the whole distribution shifts.
Writing out the gradient:
If we try to pull the gradient inside:
The problem:
We can differentiate , but is the unknown environment dynamics. How do we differentiate through something we don’t know?
This is the fundamental challenge of policy gradient methods: the gradient of an expectation over a distribution that depends on the parameters we’re differentiating.
The policy gradient theorem (next chapter) provides an elegant solution.
Preview: The Policy Gradient Theorem
The remarkable insight of the policy gradient theorem is that we can compute the gradient without knowing the environment dynamics. The gradient turns out to be:
This is an expectation over trajectories from our current policy, involving only:
- The gradient of the log-policy (which we know)
- The returns (which we can observe)
We don’t need to know environment dynamics - we just sample trajectories and compute this estimate!
Estimating the Objective
Even if we can’t compute the true gradient, we can estimate the objective from samples.
import numpy as np
import torch
def estimate_objective(policy, env, n_episodes=100, gamma=0.99):
"""
Estimate J(theta) by sampling episodes.
Args:
policy: A policy with .sample(state) method
env: Gym-like environment
n_episodes: Number of episodes to sample
gamma: Discount factor
Returns:
Mean return (estimate of J(theta))
"""
returns = []
for _ in range(n_episodes):
state, _ = env.reset()
done = False
episode_return = 0
discount = 1.0
while not done:
# Sample action from policy
state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
action = policy.sample(state_tensor)
# Take step in environment
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Accumulate discounted return
episode_return += discount * reward
discount *= gamma
state = next_state
returns.append(episode_return)
return np.mean(returns), np.std(returns) / np.sqrt(n_episodes)
# Note: This estimates J(theta), not its gradient
# The gradient estimation is more involved (next chapter)Finite Differences (A Naive Approach)
One way to estimate the gradient is finite differences: perturb each parameter slightly, measure how changes, and use that as a gradient estimate.
where is a unit vector in direction .
This works but is extremely inefficient - you need at least two objective evaluations per parameter, and each evaluation requires many episodes.
def finite_difference_gradient(policy, env, theta, epsilon=0.01, n_episodes=50):
"""
Estimate gradient via finite differences.
WARNING: Very slow! O(|theta| * n_episodes) environment interactions.
"""
gradient = np.zeros_like(theta)
for i in range(len(theta)):
# Perturb positively
theta_plus = theta.copy()
theta_plus[i] += epsilon
policy.set_params(theta_plus)
J_plus = estimate_objective(policy, env, n_episodes)[0]
# Perturb negatively
theta_minus = theta.copy()
theta_minus[i] -= epsilon
policy.set_params(theta_minus)
J_minus = estimate_objective(policy, env, n_episodes)[0]
# Central difference estimate
gradient[i] = (J_plus - J_minus) / (2 * epsilon)
# Restore original parameters
policy.set_params(theta)
return gradientFinite differences are:
- Slow: times slower than gradient-based methods
- High variance: Each estimate is noisy
- Not scalable: Neural networks have millions of parameters
The policy gradient theorem gives us a much better approach.
The Landscape of
The objective is a function of potentially millions of parameters. Its landscape can be:
- Non-convex: Multiple local optima exist
- Plateaus: Regions where gradients are near-zero
- Cliffs: Regions where small changes cause large performance drops
Policy gradient methods navigate this landscape using stochastic gradient ascent. They don’t guarantee finding the global optimum, but they often find good local optima.
Consider a simple policy with two parameters . The objective landscape might look like a bumpy surface with multiple peaks:
- Global maximum: The best possible policy
- Local maxima: Good policies that aren’t optimal
- Saddle points: Flat regions that aren’t optima
Gradient ascent will converge to a local maximum (or saddle point). Different initializations may find different optima.
Summary
The policy objective captures what we want to optimize:
- Objective: Expected return
- Goal: Find
- Method: Gradient ascent
- Challenge: Computing when the expectation depends on
The next chapter reveals how the policy gradient theorem solves this challenge, enabling us to estimate gradients from samples alone.