Trust Regions
Policy gradient methods are powerful but fragile. A single bad update can destroy a good policy, undoing thousands of steps of learning. Trust regions address this by constraining how much the policy can change in each update.
The Problem: Policy Collapse
Imagine you’ve trained an agent for hours and it’s finally performing well. Then you do one more update with a particularly unusual batch of data. Suddenly, performance crashes to zero.
This happens because the policy gradient tells you the direction to move, but not how far. A large step in the “right” direction can overshoot and land in a terrible region of policy space.
Unlike supervised learning where data is fixed, in RL the data distribution depends on the policy. A bad policy generates bad data, which leads to worse updates, creating a death spiral.
The Cliff Example
Consider a cliff-walking problem where the agent has learned to walk along a narrow safe path:
- A rare batch happens to contain mostly “go left” actions that led to rewards
- The policy update strongly reinforces “go left”
- Now the policy favors “go left” too much
- Agent walks off the cliff repeatedly
- New experience is all failures
- Policy never recovers
A single aggressive update destroyed a carefully learned policy.
Why Gradient Descent Isn’t Enough
In supervised learning, the loss landscape is fixed. Taking a step in the gradient direction is guaranteed to locally decrease the loss (for small enough step size).
In policy gradient RL, the objective is:
When we change , we change , which changes which trajectories are likely. The expectation is over a distribution that moves with us.
The gradient is only valid locally. If we step too far, we’re optimizing a stale estimate that no longer reflects reality.
Think of optimizing while hiking in fog:
- Supervised learning: The terrain is fixed. You can see a few feet ahead, take a step, look again. Gradual progress.
- Policy gradient: The terrain shifts when you move. A step that looks good based on your current view might drop you into a chasm that wasn’t there before.
Trust regions say: “Only take small steps, so the terrain can’t change too much between updates.”
The Surrogate Objective
Policy gradient methods often use a surrogate objective for optimization:
This is the expected advantage under the old policy, weighted by the probability ratio .
Key insight: We can compute this using data from the old policy (importance sampling).
Problem: This surrogate is only accurate when . Large policy changes make the importance weights unreliable.
The surrogate objective lets us reuse old experience. But it’s based on the assumption that the old and new policies are similar.
When , the policies are identical. When , the new policy is twice as likely to take this action. When , we’re extrapolating wildly - the surrogate becomes unreliable.
TRPO: The Principled Solution
TRPO maximizes the surrogate objective subject to a KL divergence constraint:
This ensures the new policy stays close to the old one, where our surrogate objective is accurate.
The KL divergence measures how different two distributions are:
Properties:
- when the policies are identical
- otherwise
- Not symmetric:
The constraint creates a “trust region” around the current policy.
TRPO says: “Improve the policy as much as possible, but don’t let it get more than away from where it started.”
The KL constraint is a soft boundary. Inside the trust region, we optimize freely. At the boundary, we stop.
Why TRPO Is Complex
TRPO solves the constrained optimization problem using:
- Conjugate gradients: Approximate the natural gradient without computing the full Hessian
- Line search: Find the largest step that satisfies the KL constraint
- Fisher Information Matrix: Second-order information about the policy
This is mathematically elegant but:
- Requires second-order derivatives (expensive)
- Needs careful implementation of conjugate gradients
- Line search adds computational overhead
- Hard to parallelize with standard deep learning libraries
# TRPO pseudocode (simplified)
def trpo_update(policy, states, actions, advantages, max_kl=0.01):
"""
TRPO update - conceptual outline.
Real implementations are much more complex.
"""
# 1. Compute policy gradient
loss = -compute_surrogate_loss(policy, states, actions, advantages)
g = compute_gradient(loss, policy.parameters())
# 2. Compute natural gradient using conjugate gradients
# This requires the Fisher Information Matrix
Hv = lambda v: fisher_vector_product(policy, states, v)
natural_gradient = conjugate_gradient(Hv, g)
# 3. Compute step size using KL constraint
step_size = compute_step_size(natural_gradient, Hv, max_kl)
# 4. Line search to ensure KL constraint is satisfied
for fraction in [1.0, 0.5, 0.25, 0.125]:
new_params = old_params + fraction * step_size * natural_gradient
if compute_kl(old_policy, new_policy) < max_kl:
break
policy.set_parameters(new_params)The Need for Simplicity
TRPO is theoretically sound but practically cumbersome:
- 100+ lines of specialized optimization code
- Hard to integrate with standard PyTorch/TensorFlow optimizers
- Doesn’t parallelize well on GPUs
- Many hyperparameters for the inner optimization
The question became: Is there a simpler way to get similar results?
PPO answers this with a resounding “yes.”
Preview: PPO’s Approach
PPO replaces TRPO’s complex constrained optimization with a simple idea:
Instead of constraining the KL divergence, clip the probability ratio.
When the ratio gets too far from 1, stop the gradient.
- Ratio near 1: policies are similar, optimize normally
- Ratio far from 1: policies are too different, stop updating
This achieves similar trust-region behavior with:
- Standard gradient descent
- No second-order derivatives
- No line search
- Much simpler code
TRPO constraint:
PPO approximation:
The ratio constraint is a pointwise version of the KL constraint. If the ratio is bounded for all actions, the KL divergence is also bounded (though the converse isn’t true).
Summary
Trust regions address the instability of policy gradient methods:
- Problem: Large policy updates can be catastrophic
- Cause: Gradient is only locally accurate; data distribution shifts with policy
- Solution: Constrain how much the policy can change
- TRPO: KL constraint with second-order optimization
- PPO: Simpler clipping approach (next section)
The next section shows exactly how PPO implements trust-region-like behavior with nothing more than clipping and standard gradient descent.