Introduction to Policy-Based Methods
What You'll Learn
- Explain the fundamental difference between value-based and policy-based methods
- Understand why parameterized policies are essential for continuous action spaces
- Describe the intuition behind policy gradient optimization
- Recognize the advantages and trade-offs of policy-based approaches
- Implement a simple softmax policy with linear features
A Different Way to Think About Learning
In the Q-learning section, we learned to evaluate actions—computing to determine how good each action is. To act, we simply picked the action with the highest Q-value: .
This works brilliantly for many problems. But what if instead of asking “how good is this action?” we asked directly “what action should I take?”
That’s the essence of policy-based methods—learning to act, not just to evaluate.
Think about how humans learn skills. When learning to ride a bike, you don’t consciously compute the expected value of every possible handlebar angle. You develop an intuition—a policy—that directly maps what you feel to what you do.
Policy-based methods work the same way. Instead of learning a value function and deriving actions from it, we learn the policy itself.
Value-based approach:
- Learn how good each action is:
- Pick the best one:
Policy-based approach:
- Learn what to do directly:
- Sample or pick from this policy
Why Do We Need Policy-Based Methods?
If Q-learning works, why learn a whole new approach? Let’s revisit a limitation we touched on in the frontiers chapter.
The Continuous Action Problem
Q-learning’s action selection requires finding —the action with the highest value. This is easy when you have 4 actions: just check all 4 and pick the winner.
But what if your actions are continuous? Imagine controlling a robot arm where each joint angle can be any real number. Now requires finding the maximum of a function over an infinite set—a costly optimization problem at every single step.
Policy-based methods sidestep this entirely. Instead of learning and optimizing over , we learn a policy that directly outputs actions. For continuous control, this might be a Gaussian distribution: . We just sample from it—no optimization required!
Mathematical Details
The fundamental challenge with Q-learning for continuous actions:
When is continuous, this optimization must be solved at every time step. Common approaches include:
- Discretization (loses precision, exponential in dimensions)
- Sampling-based optimization (expensive, approximate)
- Learning a separate “actor” network (this leads to actor-critic!)
Policy-based methods avoid this by parameterizing the policy directly:
For continuous actions, we typically use a Gaussian:
Sampling is trivial: .
Beyond Continuous Actions
Even for discrete actions, policy-based methods offer advantages:
1. Stochastic policies can be optimal. In some problems—especially games with adversaries—the best strategy is inherently random. Think of rock-paper-scissors: a deterministic strategy can be exploited, but randomly choosing each option with probability 1/3 is unexploitable.
2. Smoother optimization. Q-learning’s creates sharp discontinuities. A tiny change in Q-values can flip which action is selected, causing large changes in behavior. Policy gradients change action probabilities smoothly.
3. Simpler architecture. To act with Q-learning, you need to evaluate for all actions and find the maximum. With a policy network, you just forward-pass the state and get action probabilities directly.
Policy-based methods aren’t strictly better than value-based methods. They tend to have higher variance (we’ll see why soon) and can be less sample-efficient. The best modern algorithms often combine both approaches—we’ll see this with actor-critic methods.
Parameterized Policies
The key to policy-based learning is representing the policy with learnable parameters . We write this as .
A parameterized policy is simply a function that:
- Takes a state as input
- Outputs a probability distribution over actions
- Has adjustable parameters that we can tune
The simplest example is a softmax policy for discrete actions. Given preference scores for each action, we convert them to probabilities:
Higher preference → higher probability, but every action keeps some probability.
Softmax Policy Example
Implementation
import numpy as np
class SoftmaxPolicy:
"""
Softmax policy for discrete actions.
Uses linear preferences: h(s,a) = theta[a] @ features(s)
"""
def __init__(self, n_features, n_actions, temperature=1.0):
"""
Args:
n_features: Dimension of state feature vector
n_actions: Number of discrete actions
temperature: Higher = more exploration, lower = more greedy
"""
self.n_actions = n_actions
self.temperature = temperature
# Parameters: one weight vector per action
self.theta = np.zeros((n_actions, n_features))
def preferences(self, state_features):
"""Compute preference scores for each action."""
return self.theta @ state_features # Shape: (n_actions,)
def probabilities(self, state_features):
"""Convert preferences to action probabilities via softmax."""
h = self.preferences(state_features) / self.temperature
# Subtract max for numerical stability
h = h - np.max(h)
exp_h = np.exp(h)
return exp_h / np.sum(exp_h)
def sample_action(self, state_features):
"""Sample an action from the policy."""
probs = self.probabilities(state_features)
return np.random.choice(self.n_actions, p=probs)
def log_probability(self, state_features, action):
"""Compute log π(a|s) - needed for policy gradients."""
probs = self.probabilities(state_features)
return np.log(probs[action] + 1e-10) # Small constant for stabilityLet’s see this in action:
# Example: 3-dimensional state features, 4 actions
policy = SoftmaxPolicy(n_features=3, n_actions=4)
# Some example state features
state = np.array([1.0, 0.5, -0.3])
# Initially, all actions have equal probability (theta is zero)
print("Initial probabilities:", policy.probabilities(state))
# Output: [0.25, 0.25, 0.25, 0.25]
# Let's manually set preferences to favor action 2
policy.theta[2] = np.array([1.0, 0.0, 0.0]) # High weight on first feature
print("After adjustment:", policy.probabilities(state))
# Output: [0.155, 0.155, 0.535, 0.155] # Action 2 now most likelyMathematical Details
For a softmax policy with linear preferences , the gradient of the log-probability has a particularly nice form:
This is the feature vector for the chosen action minus the expected feature vector under the policy. Intuitively: we’re moving toward actions we take and away from what we’d “typically” do.
When we use neural networks, is computed by the network, and we rely on automatic differentiation to compute gradients.
Gaussian Policies for Continuous Actions
For continuous actions, we parameterize a Gaussian (normal) distribution. The policy outputs:
- Mean : the “intended” action
- Standard deviation : how much to explore around it
To act, we sample: where .
This is the natural extension of softmax to continuous spaces—instead of probabilities over discrete choices, we have a probability density over a continuous range.
Implementation
import numpy as np
class GaussianPolicy:
"""
Gaussian policy for continuous 1D actions.
Mean is a linear function of state features.
Log std is a learnable parameter (shared across states).
"""
def __init__(self, n_features):
# Parameters for mean: theta @ features(s)
self.theta_mean = np.zeros(n_features)
# Log standard deviation (learnable, but shared across states)
self.log_std = 0.0 # Initial std = 1.0
def mean(self, state_features):
"""Compute mean of action distribution."""
return np.dot(self.theta_mean, state_features)
def std(self):
"""Get standard deviation."""
return np.exp(self.log_std)
def sample_action(self, state_features):
"""Sample action from Gaussian policy."""
mu = self.mean(state_features)
sigma = self.std()
return np.random.normal(mu, sigma)
def log_probability(self, state_features, action):
"""Compute log probability of action."""
mu = self.mean(state_features)
sigma = self.std()
# Log of Gaussian PDF
return -0.5 * ((action - mu) / sigma) ** 2 - np.log(sigma) - 0.5 * np.log(2 * np.pi)In practice, we often use neural networks to compute and . The network architecture is similar to a Q-network, but the outputs are distribution parameters rather than action values.
The Policy Gradient Idea
Now we come to the central question: how do we improve a parameterized policy?
The high-level idea is simple:
- Define what “good” means: The expected total reward when following the policy
- Compute the gradient: How does expected reward change as we adjust ?
- Gradient ascent: Update to increase expected reward
This is just optimization—find the parameters that maximize performance. The challenge is step 2: computing the gradient of an expectation.
The Objective Function
Mathematical Details
We want to maximize the expected return when following policy :
where:
- is a trajectory
- is the discounted return
- The expectation is over trajectories sampled by following
Our goal is to find:
We’ll use gradient ascent:
Why Is This Hard?
The tricky part is computing . The expectation is over trajectories, and trajectories depend on (because the policy determines which actions are taken).
It’s like trying to take the derivative of an average, where the things you’re averaging over change when you change the parameters.
Naively, you might think we need to differentiate through the environment dynamics—which we don’t know! Remarkably, the policy gradient theorem shows we can estimate this gradient using only samples, without knowing the dynamics.
Why can't we just differentiate through trajectories?
The expected return can be written as:
where is the probability of trajectory under policy .
If we naively differentiate:
The problem is that involves the environment dynamics:
We can differentiate through , but not through —we don’t even know this function!
The policy gradient theorem cleverly rewrites this gradient in a form that doesn’t require knowing the dynamics. We’ll derive this in the next chapter.
Comparing Value-Based and Policy-Based
Let’s summarize the key differences:
| Aspect | Value-Based (Q-Learning) | Policy-Based |
|---|---|---|
| What we learn | values | directly |
| How we act | Sample from | |
| Continuous actions | Requires optimization | Natural (sample from distribution) |
| Stochastic policies | Forced (ε-greedy for exploration) | Natural |
| Sample efficiency | Often better (off-policy possible) | Often worse (on-policy) |
| Variance | Lower (bootstrapping) | Higher (Monte Carlo) |
| Stability | Can be unstable (maximization bias) | Smoother optimization |
Neither approach is strictly better—they have different trade-offs. The best modern algorithms (like PPO, SAC) combine both ideas.
Summary
Key Takeaways
- Policy-based methods learn a parameterized policy directly, rather than learning value functions
- Continuous actions are natural: output a Gaussian distribution and sample from it
- Stochastic policies are first-class citizens, not just exploration tricks
- Parameterized policies can use linear functions, neural networks, or any differentiable architecture
- The objective is expected return ; we improve by gradient ascent on
- The challenge is computing when the expectation depends on
We’ve seen why policy-based methods are valuable and what we’re trying to learn. The big question remains: how do we compute the gradient of expected return?
That’s exactly what the policy gradient theorem tells us—and it’s the focus of our next chapter.
Exercises
Conceptual Questions
-
Explain the continuous action problem. Why does become difficult when actions are continuous? Describe a concrete example.
-
What does it mean for a policy to be “parameterized”? Give an example of a parameterized policy with 4 actions and explain what the parameters control.
-
When might a stochastic policy be genuinely optimal? Give an example problem where the best strategy is inherently random.
-
Compare exploration in Q-learning vs. policy gradients. How does each approach handle the exploration-exploitation trade-off?
Coding Challenges
-
Implement a softmax policy that takes a 4-dimensional state feature vector and outputs probabilities over 3 actions. Test it by:
- Verifying probabilities sum to 1
- Showing that changing parameters changes which action is most likely
- Demonstrating how temperature affects the probability distribution
-
Visualize a Gaussian policy. Create a plot showing:
- The mean action as a function of a 1D state
- The standard deviation as a shaded region around the mean
- How changing changes the policy
Exploration
- Temperature in softmax policies. Experiment with different temperature values (0.1, 1.0, 10.0) in a softmax policy. What happens as temperature approaches 0? As it approaches infinity? When might you want each extreme?