Chapter 201
📝Draft

Introduction to Policy-Based Methods

Discover a fundamentally different approach to RL: learning policies directly instead of value functions

Introduction to Policy-Based Methods

What You'll Learn

  • Explain the fundamental difference between value-based and policy-based methods
  • Understand why parameterized policies are essential for continuous action spaces
  • Describe the intuition behind policy gradient optimization
  • Recognize the advantages and trade-offs of policy-based approaches
  • Implement a simple softmax policy with linear features

A Different Way to Think About Learning

In the Q-learning section, we learned to evaluate actions—computing Q(s,a)Q(s, a) to determine how good each action is. To act, we simply picked the action with the highest Q-value: argmaxaQ(s,a)\arg\max_a Q(s, a).

This works brilliantly for many problems. But what if instead of asking “how good is this action?” we asked directly “what action should I take?”

That’s the essence of policy-based methods—learning to act, not just to evaluate.

Think about how humans learn skills. When learning to ride a bike, you don’t consciously compute the expected value of every possible handlebar angle. You develop an intuition—a policy—that directly maps what you feel to what you do.

Policy-based methods work the same way. Instead of learning a value function and deriving actions from it, we learn the policy itself.

Value-based approach:

  1. Learn how good each action is: Q(s,a)Q(s, a)
  2. Pick the best one: argmaxaQ(s,a)\arg\max_a Q(s, a)

Policy-based approach:

  1. Learn what to do directly: π(as)\pi(a|s)
  2. Sample or pick from this policy

Why Do We Need Policy-Based Methods?

If Q-learning works, why learn a whole new approach? Let’s revisit a limitation we touched on in the frontiers chapter.

The Continuous Action Problem

Q-learning’s action selection requires finding argmaxaQ(s,a)\arg\max_a Q(s, a)—the action with the highest value. This is easy when you have 4 actions: just check all 4 and pick the winner.

But what if your actions are continuous? Imagine controlling a robot arm where each joint angle can be any real number. Now argmax\arg\max requires finding the maximum of a function over an infinite set—a costly optimization problem at every single step.

Policy-based methods sidestep this entirely. Instead of learning Q(s,a)Q(s, a) and optimizing over aa, we learn a policy πθ(as)\pi_\theta(a|s) that directly outputs actions. For continuous control, this might be a Gaussian distribution: πθ(as)=N(μθ(s),σθ(s)2)\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2). We just sample from it—no optimization required!

Mathematical Details

The fundamental challenge with Q-learning for continuous actions:

a=argmaxaAQ(s,a)a^* = \arg\max_{a \in \mathcal{A}} Q(s, a)

When ARn\mathcal{A} \subset \mathbb{R}^n is continuous, this optimization must be solved at every time step. Common approaches include:

  • Discretization (loses precision, exponential in dimensions)
  • Sampling-based optimization (expensive, approximate)
  • Learning a separate “actor” network (this leads to actor-critic!)

Policy-based methods avoid this by parameterizing the policy directly:

πθ(as)=probability of action a in state s\pi_\theta(a|s) = \text{probability of action } a \text{ in state } s

For continuous actions, we typically use a Gaussian: πθ(as)=1σθ(s)2πexp((aμθ(s))22σθ(s)2)\pi_\theta(a|s) = \frac{1}{\sigma_\theta(s)\sqrt{2\pi}} \exp\left(-\frac{(a - \mu_\theta(s))^2}{2\sigma_\theta(s)^2}\right)

Sampling is trivial: aN(μθ(s),σθ(s)2)a \sim \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2).

Beyond Continuous Actions

Even for discrete actions, policy-based methods offer advantages:

1. Stochastic policies can be optimal. In some problems—especially games with adversaries—the best strategy is inherently random. Think of rock-paper-scissors: a deterministic strategy can be exploited, but randomly choosing each option with probability 1/3 is unexploitable.

2. Smoother optimization. Q-learning’s argmax\arg\max creates sharp discontinuities. A tiny change in Q-values can flip which action is selected, causing large changes in behavior. Policy gradients change action probabilities smoothly.

3. Simpler architecture. To act with Q-learning, you need to evaluate Q(s,a)Q(s, a) for all actions and find the maximum. With a policy network, you just forward-pass the state and get action probabilities directly.

Parameterized Policies

The key to policy-based learning is representing the policy with learnable parameters θ\theta. We write this as πθ(as)\pi_\theta(a|s).

A parameterized policy is simply a function that:

  1. Takes a state ss as input
  2. Outputs a probability distribution over actions
  3. Has adjustable parameters θ\theta that we can tune

The simplest example is a softmax policy for discrete actions. Given preference scores h(s,a,θ)h(s, a, \theta) for each action, we convert them to probabilities:

πθ(as)=exp(h(s,a,θ))aexp(h(s,a,θ))\pi_\theta(a|s) = \frac{\exp(h(s, a, \theta))}{\sum_{a'} \exp(h(s, a', \theta))}

Higher preference → higher probability, but every action keeps some probability.

Softmax Policy Example

</>Implementation
import numpy as np

class SoftmaxPolicy:
    """
    Softmax policy for discrete actions.

    Uses linear preferences: h(s,a) = theta[a] @ features(s)
    """

    def __init__(self, n_features, n_actions, temperature=1.0):
        """
        Args:
            n_features: Dimension of state feature vector
            n_actions: Number of discrete actions
            temperature: Higher = more exploration, lower = more greedy
        """
        self.n_actions = n_actions
        self.temperature = temperature
        # Parameters: one weight vector per action
        self.theta = np.zeros((n_actions, n_features))

    def preferences(self, state_features):
        """Compute preference scores for each action."""
        return self.theta @ state_features  # Shape: (n_actions,)

    def probabilities(self, state_features):
        """Convert preferences to action probabilities via softmax."""
        h = self.preferences(state_features) / self.temperature
        # Subtract max for numerical stability
        h = h - np.max(h)
        exp_h = np.exp(h)
        return exp_h / np.sum(exp_h)

    def sample_action(self, state_features):
        """Sample an action from the policy."""
        probs = self.probabilities(state_features)
        return np.random.choice(self.n_actions, p=probs)

    def log_probability(self, state_features, action):
        """Compute log π(a|s) - needed for policy gradients."""
        probs = self.probabilities(state_features)
        return np.log(probs[action] + 1e-10)  # Small constant for stability

Let’s see this in action:

# Example: 3-dimensional state features, 4 actions
policy = SoftmaxPolicy(n_features=3, n_actions=4)

# Some example state features
state = np.array([1.0, 0.5, -0.3])

# Initially, all actions have equal probability (theta is zero)
print("Initial probabilities:", policy.probabilities(state))
# Output: [0.25, 0.25, 0.25, 0.25]

# Let's manually set preferences to favor action 2
policy.theta[2] = np.array([1.0, 0.0, 0.0])  # High weight on first feature

print("After adjustment:", policy.probabilities(state))
# Output: [0.155, 0.155, 0.535, 0.155]  # Action 2 now most likely
Mathematical Details

For a softmax policy with linear preferences h(s,a,θ)=θax(s)h(s, a, \theta) = \theta_a^\top \mathbf{x}(s), the gradient of the log-probability has a particularly nice form:

θlogπθ(as)=x(s,a)aπθ(as)x(s,a)\nabla_\theta \log \pi_\theta(a|s) = \mathbf{x}(s, a) - \sum_{a'} \pi_\theta(a'|s) \mathbf{x}(s, a')

This is the feature vector for the chosen action minus the expected feature vector under the policy. Intuitively: we’re moving toward actions we take and away from what we’d “typically” do.

When we use neural networks, h(s,a,θ)h(s, a, \theta) is computed by the network, and we rely on automatic differentiation to compute gradients.

Gaussian Policies for Continuous Actions

For continuous actions, we parameterize a Gaussian (normal) distribution. The policy outputs:

  • Mean μθ(s)\mu_\theta(s): the “intended” action
  • Standard deviation σθ(s)\sigma_\theta(s): how much to explore around it

To act, we sample: a=μθ(s)+σθ(s)ϵa = \mu_\theta(s) + \sigma_\theta(s) \cdot \epsilon where ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1).

This is the natural extension of softmax to continuous spaces—instead of probabilities over discrete choices, we have a probability density over a continuous range.

</>Implementation
import numpy as np

class GaussianPolicy:
    """
    Gaussian policy for continuous 1D actions.

    Mean is a linear function of state features.
    Log std is a learnable parameter (shared across states).
    """

    def __init__(self, n_features):
        # Parameters for mean: theta @ features(s)
        self.theta_mean = np.zeros(n_features)
        # Log standard deviation (learnable, but shared across states)
        self.log_std = 0.0  # Initial std = 1.0

    def mean(self, state_features):
        """Compute mean of action distribution."""
        return np.dot(self.theta_mean, state_features)

    def std(self):
        """Get standard deviation."""
        return np.exp(self.log_std)

    def sample_action(self, state_features):
        """Sample action from Gaussian policy."""
        mu = self.mean(state_features)
        sigma = self.std()
        return np.random.normal(mu, sigma)

    def log_probability(self, state_features, action):
        """Compute log probability of action."""
        mu = self.mean(state_features)
        sigma = self.std()
        # Log of Gaussian PDF
        return -0.5 * ((action - mu) / sigma) ** 2 - np.log(sigma) - 0.5 * np.log(2 * np.pi)
ℹ️Note

In practice, we often use neural networks to compute μθ(s)\mu_\theta(s) and logσθ(s)\log \sigma_\theta(s). The network architecture is similar to a Q-network, but the outputs are distribution parameters rather than action values.

The Policy Gradient Idea

Now we come to the central question: how do we improve a parameterized policy?

The high-level idea is simple:

  1. Define what “good” means: The expected total reward when following the policy
  2. Compute the gradient: How does expected reward change as we adjust θ\theta?
  3. Gradient ascent: Update θ\theta to increase expected reward

This is just optimization—find the parameters that maximize performance. The challenge is step 2: computing the gradient of an expectation.

The Objective Function

Mathematical Details

We want to maximize the expected return when following policy πθ\pi_\theta:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

where:

  • τ=(s0,a0,r1,s1,a1,r2,...)\tau = (s_0, a_0, r_1, s_1, a_1, r_2, ...) is a trajectory
  • R(τ)=t=0Tγtrt+1R(\tau) = \sum_{t=0}^{T} \gamma^t r_{t+1} is the discounted return
  • The expectation is over trajectories sampled by following πθ\pi_\theta

Our goal is to find: θ=argmaxθJ(θ)\theta^* = \arg\max_\theta J(\theta)

We’ll use gradient ascent: θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

Why Is This Hard?

The tricky part is computing θJ(θ)\nabla_\theta J(\theta). The expectation is over trajectories, and trajectories depend on θ\theta (because the policy determines which actions are taken).

It’s like trying to take the derivative of an average, where the things you’re averaging over change when you change the parameters.

Naively, you might think we need to differentiate through the environment dynamics—which we don’t know! Remarkably, the policy gradient theorem shows we can estimate this gradient using only samples, without knowing the dynamics.

🔍Why can't we just differentiate through trajectories?

The expected return can be written as:

J(θ)=τP(τθ)R(τ)J(\theta) = \sum_\tau P(\tau | \theta) R(\tau)

where P(τθ)P(\tau | \theta) is the probability of trajectory τ\tau under policy πθ\pi_\theta.

If we naively differentiate:

θJ(θ)=τθP(τθ)R(τ)\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta P(\tau | \theta) \cdot R(\tau)

The problem is that P(τθ)P(\tau | \theta) involves the environment dynamics:

P(τθ)=p(s0)t=0T1πθ(atst)p(st+1st,at)P(\tau | \theta) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) \cdot p(s_{t+1}|s_t, a_t)

We can differentiate through πθ\pi_\theta, but not through p(st+1st,at)p(s_{t+1}|s_t, a_t)—we don’t even know this function!

The policy gradient theorem cleverly rewrites this gradient in a form that doesn’t require knowing the dynamics. We’ll derive this in the next chapter.

Comparing Value-Based and Policy-Based

Let’s summarize the key differences:

AspectValue-Based (Q-Learning)Policy-Based
What we learnQ(s,a)Q(s, a) valuesπθ(as)\pi_\theta(a\|s) directly
How we actargmaxaQ(s,a)\arg\max_a Q(s, a)Sample from πθ\pi_\theta
Continuous actionsRequires optimizationNatural (sample from distribution)
Stochastic policiesForced (ε-greedy for exploration)Natural
Sample efficiencyOften better (off-policy possible)Often worse (on-policy)
VarianceLower (bootstrapping)Higher (Monte Carlo)
StabilityCan be unstable (maximization bias)Smoother optimization

Neither approach is strictly better—they have different trade-offs. The best modern algorithms (like PPO, SAC) combine both ideas.

Summary

Key Takeaways

  • Policy-based methods learn a parameterized policy πθ(as)\pi_\theta(a|s) directly, rather than learning value functions
  • Continuous actions are natural: output a Gaussian distribution and sample from it
  • Stochastic policies are first-class citizens, not just exploration tricks
  • Parameterized policies can use linear functions, neural networks, or any differentiable architecture
  • The objective is expected return J(θ)J(\theta); we improve by gradient ascent on θ\theta
  • The challenge is computing θJ(θ)\nabla_\theta J(\theta) when the expectation depends on θ\theta

We’ve seen why policy-based methods are valuable and what we’re trying to learn. The big question remains: how do we compute the gradient of expected return?

That’s exactly what the policy gradient theorem tells us—and it’s the focus of our next chapter.

Next ChapterThe Policy Gradient Theorem and REINFORCE

Exercises

Conceptual Questions

  1. Explain the continuous action problem. Why does argmaxaQ(s,a)\arg\max_a Q(s, a) become difficult when actions are continuous? Describe a concrete example.

  2. What does it mean for a policy to be “parameterized”? Give an example of a parameterized policy with 4 actions and explain what the parameters control.

  3. When might a stochastic policy be genuinely optimal? Give an example problem where the best strategy is inherently random.

  4. Compare exploration in Q-learning vs. policy gradients. How does each approach handle the exploration-exploitation trade-off?

Coding Challenges

  1. Implement a softmax policy that takes a 4-dimensional state feature vector and outputs probabilities over 3 actions. Test it by:

    • Verifying probabilities sum to 1
    • Showing that changing parameters changes which action is most likely
    • Demonstrating how temperature affects the probability distribution
  2. Visualize a Gaussian policy. Create a plot showing:

    • The mean action as a function of a 1D state
    • The standard deviation as a shaded region around the mean
    • How changing θ\theta changes the policy

Exploration

  1. Temperature in softmax policies. Experiment with different temperature values (0.1, 1.0, 10.0) in a softmax policy. What happens as temperature approaches 0? As it approaches infinity? When might you want each extreme?