Why Learn Policies Directly?
In Q-learning and other value-based methods, we learned a value function and then derived a policy from it by picking the action with highest value. This works brilliantly for many problems. But what if we could skip the middleman and learn the policy directly?
The Value-Based Paradigm
Value-based methods follow a two-step process:
- Learn values: Compute for every state-action pair
- Derive policy: Pick the action with highest Q-value:
This approach has a fundamental asymmetry: we care about the policy (what to do), but we learn something else (how good things are). Sometimes it makes more sense to learn what we actually want.
Think about how you learned to catch a ball. You didn’t compute the expected future “catching value” of every possible hand position. You developed a direct intuition - a policy - that maps what you see to how you move your hands.
In value-based RL, the policy is implicitly defined by the value function:
This requires:
- Computing or storing for all actions
- Finding the maximum over all actions at decision time
- A discrete action space (or expensive continuous optimization)
The Continuous Action Problem
The most compelling reason for policy-based methods is handling continuous actions.
Imagine you’re controlling a robot arm. Each joint can be at any angle - not just “up” or “down,” but any value between 0 and 360 degrees. With 6 joints, you have a 6-dimensional continuous action space.
With Q-learning, to select an action you’d need to solve:
This is a 6-dimensional optimization problem that must be solved at every single timestep. Not practical!
Policy-based methods sidestep this entirely. Instead of learning and searching for the best action, we learn a policy that directly outputs actions. For continuous control, this might be a Gaussian distribution:
To act, we just sample from this distribution. No optimization required.
For continuous action spaces , the argmax operation becomes an optimization problem:
Common workarounds for value-based methods:
- Discretization: Convert continuous space to discrete bins (exponential in dimensions)
- Sampling: Sample random actions and pick the best (approximate, slow)
- Learned optimization: Train a separate network to output (this is essentially actor-critic)
Policy-based methods avoid all of this by parameterizing the policy directly. A Gaussian policy is:
Sampling is trivial: where .
import torch
import torch.nn as nn
import numpy as np
class GaussianPolicy(nn.Module):
"""
Gaussian policy for continuous actions.
Outputs mean and log_std of a Gaussian distribution.
Sampling is done via the reparameterization trick.
"""
def __init__(self, state_dim, action_dim, hidden_dim=64):
super().__init__()
# Shared feature layers
self.features = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
)
# Mean head
self.mean_head = nn.Linear(hidden_dim, action_dim)
# Log std (learnable parameter, shared across states)
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, state):
"""Return mean and std of action distribution."""
features = self.features(state)
mean = self.mean_head(features)
std = torch.exp(self.log_std)
return mean, std
def sample(self, state):
"""Sample action using reparameterization trick."""
mean, std = self.forward(state)
# Reparameterization: a = mean + std * noise
noise = torch.randn_like(mean)
action = mean + std * noise
return action
def log_prob(self, state, action):
"""Compute log probability of action."""
mean, std = self.forward(state)
# Gaussian log probability
var = std ** 2
log_prob = -0.5 * (((action - mean) ** 2) / var + torch.log(var) + np.log(2 * np.pi))
return log_prob.sum(dim=-1) # Sum over action dimensions
# Example usage
policy = GaussianPolicy(state_dim=4, action_dim=2)
state = torch.randn(1, 4) # Batch of 1 state
# Get distribution parameters
mean, std = policy(state)
print(f"Mean action: {mean.detach().numpy()}")
print(f"Std: {std.detach().numpy()}")
# Sample an action
action = policy.sample(state)
print(f"Sampled action: {action.detach().numpy()}")Beyond Continuous Actions
Even for discrete action spaces, policy-based methods offer advantages.
Stochastic Policies Can Be Optimal
In some problems, the best strategy is inherently random. Consider rock-paper-scissors:
- If you always play rock, your opponent exploits you by always playing paper
- If you play each option with probability 1/3, no opponent can exploit you
This is a mixed strategy - a stochastic policy. Value-based methods with argmax can only represent deterministic policies. Policy gradient methods naturally represent probability distributions.
The Aliased GridWorld
Imagine a GridWorld where two different states look identical to the agent (same observation). In one state, going left is optimal; in the other, going right is optimal.
A deterministic policy must choose one direction - and will be wrong half the time. A stochastic policy that goes left with 50% probability does better on average.
This is called perceptual aliasing, and it arises whenever the agent doesn’t have full state information.
Smoother Optimization
Q-learning’s argmax creates sharp discontinuities. Imagine two actions have Q-values of 10.0 and 10.1:
- Q-learning: Always picks action 2 (probability 1.0)
- Small change in Q-values: If they flip to 10.1 and 10.0, behavior completely reverses
Policy gradients change probabilities smoothly:
- Instead of 0% vs 100%, you might have 45% vs 55%
- Small parameter changes cause small behavior changes
This smooth optimization landscape often makes learning more stable.
Simpler Decision-Making
With Q-learning, to act you must:
- Compute for every action
- Find the maximum
- Break ties somehow
With a policy network, you:
- Forward pass the state
- Sample from the output distribution
For large action spaces, the policy approach is more efficient. You don’t need to evaluate all actions to know which one to take.
Comparing the Approaches
| Aspect | Value-Based | Policy-Based |
|---|---|---|
| What we learn | values | directly |
| How we act | Sample from | |
| Continuous actions | Requires optimization | Natural |
| Stochastic policies | Via exploration only | First-class |
| Sample efficiency | Often better | Often worse |
| Stability | Can be unstable | Smoother gradients |
Policy-based methods aren’t always better. They typically have:
- Higher variance: Gradient estimates are noisy
- Lower sample efficiency: On-policy learning wastes data
- Local optima risk: May converge to suboptimal policies
The best modern algorithms often combine both approaches - we’ll see this with actor-critic methods.
When to Use Policy-Based Methods
Choose policy-based methods when:
- Actions are continuous (robotics, control)
- Stochastic policies are valuable (partial observability, games)
- You want stable, smooth optimization
- The action space is large
Choose value-based methods when:
- Actions are discrete and few
- Sample efficiency matters (off-policy methods like DQN)
- You need good exploration (optimistic initialization, UCB)
Choose both (actor-critic) when:
- You want the best of both worlds
- Stability and efficiency both matter
- You’re working on a modern deep RL problem
Summary
Policy-based methods represent a fundamental shift in how we think about RL:
- Instead of: Learn values, derive policy
- We: Learn the policy directly
This shift opens up continuous action spaces, enables stochastic policies, and often provides smoother optimization. The key challenge - which we’ll address in coming sections - is computing the gradient of expected return.