Policy Gradient Methods • Part 2 of 3
📝Draft

Stochastic Policies

Probability distributions over actions

Stochastic Policies

A stochastic policy outputs a probability distribution over actions rather than a single action. Instead of “always do X,” the policy says “do X with probability 70%, Y with probability 30%.”

This is more than just an exploration trick - stochastic policies are fundamental to policy gradient methods and sometimes represent the truly optimal behavior.

What Is a Stochastic Policy?

📖Stochastic Policy

A stochastic policy π(as)\pi(a|s) is a mapping from states to probability distributions over actions:

π:SP(A)\pi: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A})

For each state ss, π(s)\pi(\cdot|s) is a probability distribution satisfying:

  • π(as)0\pi(a|s) \geq 0 for all actions aa
  • aAπ(as)=1\sum_{a \in \mathcal{A}} \pi(a|s) = 1 (discrete) or Aπ(as)da=1\int_{\mathcal{A}} \pi(a|s) da = 1 (continuous)

Think of a stochastic policy as a “soft” decision-maker. Rather than committing absolutely to one action, it hedges its bets:

“Based on what I see, I’m pretty confident action A is best (60%), but B might work (30%), and there’s a small chance C is actually right (10%).”

This uncertainty can come from:

  • Exploration: Trying new things to learn
  • Genuine uncertainty: When you’re not sure what’s best
  • Optimality: When randomness is actually the best strategy

Softmax Policies for Discrete Actions

The most common way to represent a stochastic policy over discrete actions is the softmax function.

Mathematical Details

Given preference values h(s,a,θ)h(s, a, \theta) for each action (also called logits), the softmax policy is:

πθ(as)=exp(h(s,a,θ))aAexp(h(s,a,θ))\pi_\theta(a|s) = \frac{\exp(h(s, a, \theta))}{\sum_{a' \in \mathcal{A}} \exp(h(s, a', \theta))}

Properties:

  • All probabilities are positive: πθ(as)>0\pi_\theta(a|s) > 0
  • Probabilities sum to 1
  • Higher preference means higher probability
  • Differentiable with respect to θ\theta

Common choices for h(s,a,θ)h(s, a, \theta):

  • Linear: h(s,a,θ)=θaϕ(s)h(s, a, \theta) = \theta_a^\top \phi(s) where ϕ(s)\phi(s) is a feature vector
  • Neural network: h(s,a,θ)=fθ(s)[a]h(s, a, \theta) = f_\theta(s)[a] where fθf_\theta outputs logits for all actions
</>Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class SoftmaxPolicy(nn.Module):
    """
    Softmax policy for discrete actions using a neural network.

    Takes states as input, outputs action probabilities.
    """

    def __init__(self, state_dim, n_actions, hidden_dim=64):
        super().__init__()
        self.n_actions = n_actions

        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)  # Outputs logits (preferences)
        )

    def forward(self, state):
        """Compute action probabilities."""
        logits = self.network(state)
        probs = F.softmax(logits, dim=-1)
        return probs

    def get_distribution(self, state):
        """Return a Categorical distribution for sampling."""
        probs = self.forward(state)
        return torch.distributions.Categorical(probs)

    def sample(self, state):
        """Sample an action from the policy."""
        dist = self.get_distribution(state)
        action = dist.sample()
        return action.item()

    def log_prob(self, state, action):
        """Compute log probability of an action."""
        dist = self.get_distribution(state)
        return dist.log_prob(action)


# Example: 4-dimensional state, 3 possible actions
policy = SoftmaxPolicy(state_dim=4, n_actions=3)

# Get probabilities for a state
state = torch.tensor([[1.0, 0.5, -0.3, 0.8]])
probs = policy(state)
print(f"Action probabilities: {probs.detach().numpy()}")

# Sample multiple actions to see the distribution
actions = [policy.sample(state) for _ in range(1000)]
print(f"Action counts: {np.bincount(actions, minlength=3)}")

The Temperature Parameter

The softmax function has an implicit temperature parameter that controls how “peaked” the distribution is.

Temperature works like this:

  • High temperature: More uniform distribution (more exploration)
  • Low temperature: More peaked distribution (more exploitation)
  • Temperature approaches 0: Becomes deterministic (always picks highest preference)
  • Temperature approaches infinity: Becomes uniform random

It’s like adjusting how “confident” the policy is in its preferences.

Mathematical Details

The softmax with explicit temperature τ\tau:

πθ(as)=exp(h(s,a,θ)/τ)aexp(h(s,a,θ)/τ)\pi_\theta(a|s) = \frac{\exp(h(s, a, \theta) / \tau)}{\sum_{a'} \exp(h(s, a', \theta) / \tau)}

As τ0\tau \to 0: πθ1a=argmaxah(s,a,θ)\pi_\theta \to \mathbf{1}_{a = \arg\max_{a'} h(s,a',\theta)} (deterministic)

As τ\tau \to \infty: πθUniform(A)\pi_\theta \to \text{Uniform}(\mathcal{A}) (random)

</>Implementation
def softmax_with_temperature(logits, temperature=1.0):
    """
    Apply softmax with temperature scaling.

    Args:
        logits: Unnormalized log-probabilities (preferences)
        temperature: Higher = more uniform, lower = more peaked

    Returns:
        Probability distribution over actions
    """
    scaled_logits = logits / temperature
    # Subtract max for numerical stability
    scaled_logits = scaled_logits - scaled_logits.max(dim=-1, keepdim=True)[0]
    exp_logits = torch.exp(scaled_logits)
    return exp_logits / exp_logits.sum(dim=-1, keepdim=True)


# Demonstrate temperature effects
logits = torch.tensor([[2.0, 1.0, 0.5]])  # Prefer action 0

print("Temperature effects on action probabilities:")
for temp in [0.1, 0.5, 1.0, 2.0, 10.0]:
    probs = softmax_with_temperature(logits, temp)
    print(f"  T={temp:4.1f}: {probs.numpy().round(3)}")

# Output:
#   T= 0.1: [[0.999 0.    0.   ]]
#   T= 0.5: [[0.953 0.043 0.004]]
#   T= 1.0: [[0.659 0.242 0.099]]
#   T= 2.0: [[0.466 0.308 0.226]]
#   T=10.0: [[0.356 0.331 0.313]]

Gaussian Policies for Continuous Actions

For continuous action spaces, we typically use Gaussian (Normal) policies.

Mathematical Details

A diagonal Gaussian policy outputs a mean and standard deviation for each action dimension:

πθ(as)=N(a;μθ(s),σθ(s)2)\pi_\theta(a|s) = \mathcal{N}(a; \mu_\theta(s), \sigma_\theta(s)^2)

For a single action dimension:

πθ(as)=1σθ(s)2πexp((aμθ(s))22σθ(s)2)\pi_\theta(a|s) = \frac{1}{\sigma_\theta(s)\sqrt{2\pi}} \exp\left(-\frac{(a - \mu_\theta(s))^2}{2\sigma_\theta(s)^2}\right)

The log-probability (used in policy gradients) is:

logπθ(as)=(aμθ(s))22σθ(s)2logσθ(s)12log(2π)\log \pi_\theta(a|s) = -\frac{(a - \mu_\theta(s))^2}{2\sigma_\theta(s)^2} - \log \sigma_\theta(s) - \frac{1}{2}\log(2\pi)

A Gaussian policy says: “I think the action should be around μ(s)\mu(s), but I’m not certain - the action could be anywhere within a few σ(s)\sigma(s) of that.”

  • Mean μθ(s)\mu_\theta(s): The “intended” action
  • Standard deviation σθ(s)\sigma_\theta(s): How much to explore around it

As training progresses, σ\sigma often decreases - the policy becomes more confident and exploits more.

</>Implementation
import torch
import torch.nn as nn
import numpy as np

class DiagonalGaussianPolicy(nn.Module):
    """
    Gaussian policy with diagonal covariance for continuous actions.

    Common design choices:
    - Mean is a function of state (neural network)
    - Log-std can be state-dependent or a learned constant
    """

    def __init__(self, state_dim, action_dim, hidden_dim=64,
                 log_std_min=-20, log_std_max=2):
        super().__init__()
        self.action_dim = action_dim
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max

        # Shared features
        self.features = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
        )

        # Mean head
        self.mean_head = nn.Linear(hidden_dim, action_dim)

        # Log std as a learnable parameter (state-independent)
        # Alternative: make it state-dependent with another head
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        """Return mean and std of action distribution."""
        features = self.features(state)
        mean = self.mean_head(features)

        # Clamp log_std for numerical stability
        log_std = torch.clamp(self.log_std, self.log_std_min, self.log_std_max)
        std = torch.exp(log_std)

        return mean, std

    def get_distribution(self, state):
        """Return a Normal distribution for sampling."""
        mean, std = self.forward(state)
        return torch.distributions.Normal(mean, std)

    def sample(self, state, deterministic=False):
        """
        Sample action from policy.

        Args:
            state: Current state
            deterministic: If True, return mean (no sampling)
        """
        mean, std = self.forward(state)
        if deterministic:
            return mean
        else:
            noise = torch.randn_like(mean)
            return mean + std * noise

    def log_prob(self, state, action):
        """Compute log probability of action."""
        dist = self.get_distribution(state)
        # Sum log probs across action dimensions
        return dist.log_prob(action).sum(dim=-1)


# Example: Continuous control with 3D state, 2D action
policy = DiagonalGaussianPolicy(state_dim=3, action_dim=2)

state = torch.tensor([[1.0, -0.5, 0.3]])
mean, std = policy(state)
print(f"Mean action: {mean.detach().numpy()}")
print(f"Standard deviation: {std.detach().numpy()}")

# Sample actions
actions = torch.stack([policy.sample(state) for _ in range(5)])
print(f"Sampled actions:\n{actions.detach().numpy()}")

Why Stochastic Policies?

1. Exploration

The most obvious reason: stochastic policies naturally explore. Instead of always taking the same action in a state, they occasionally try alternatives.

Unlike epsilon-greedy (which is an ad-hoc addition to deterministic policies), stochasticity is built into the policy itself. And importantly, the amount of exploration can be learned - the policy can become more deterministic as it becomes more confident.

2. Policy Gradients Require Them

Mathematical Details

The policy gradient theorem requires computing θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s). For a deterministic policy where π(as){0,1}\pi(a|s) \in \{0, 1\}:

  • The gradient is zero almost everywhere (no learning signal)
  • The gradient is undefined at the discontinuity

Stochastic policies have smooth, well-defined gradients everywhere. This is essential for gradient-based optimization.

3. Genuine Optimality

Sometimes, randomness is the best strategy:

Game theory: In rock-paper-scissors, the Nash equilibrium is playing each option with probability 1/3. Any deterministic strategy can be exploited.

Partial observability: When you can’t distinguish between states that require different actions, randomizing can be optimal.

Multi-modal rewards: If multiple actions are equally good, a stochastic policy can capture this uncertainty.

📌Example

The Stochastic Gridworld

Consider a gridworld where:

  • Two paths lead to the goal with equal expected reward
  • Path A is shorter but riskier (50% chance of penalty)
  • Path B is longer but safe

A deterministic policy must commit to one path. A stochastic policy can mix between them based on the actual risk-reward tradeoff, potentially achieving better expected performance under distribution shift.

Parameterized Policies

📖Parameterized Policy

A parameterized policy πθ(as)\pi_\theta(a|s) is a policy whose behavior is determined by a vector of parameters θ\theta. By adjusting θ\theta, we change what actions the policy prefers.

Mathematical Details

The parameters θ\theta can be:

  • Linear weights: θ={W,b}\theta = \{W, b\} where h(s,a)=Was+bah(s, a) = W_a^\top s + b_a
  • Neural network weights: All weights and biases in a deep network
  • Any differentiable function parameters

The key requirement is that πθ(as)\pi_\theta(a|s) must be differentiable with respect to θ\theta. This enables gradient-based learning.

</>Implementation
# Example: Simple linear softmax policy

class LinearSoftmaxPolicy:
    """
    Linear softmax policy (no neural network).

    h(s, a) = theta[a] @ s
    pi(a|s) = softmax(h(s, :))
    """

    def __init__(self, state_dim, n_actions):
        self.n_actions = n_actions
        # One weight vector per action
        self.theta = np.zeros((n_actions, state_dim))

    def preferences(self, state):
        """Compute preference for each action: theta @ state."""
        return self.theta @ state

    def probabilities(self, state):
        """Convert preferences to probabilities via softmax."""
        h = self.preferences(state)
        h = h - np.max(h)  # Numerical stability
        exp_h = np.exp(h)
        return exp_h / np.sum(exp_h)

    def sample(self, state):
        """Sample action from policy."""
        probs = self.probabilities(state)
        return np.random.choice(self.n_actions, p=probs)

    def gradient_log_prob(self, state, action):
        """
        Compute gradient of log pi(a|s) w.r.t. theta.

        For softmax: grad_theta log pi(a|s) = phi(s,a) - E_pi[phi(s,a')]
        where phi(s,a) is the feature for (s,a) pair.
        """
        probs = self.probabilities(state)
        grad = np.zeros_like(self.theta)

        # For linear policy: phi(s,a) = s for the a-th row
        for a in range(self.n_actions):
            if a == action:
                grad[a] = state * (1 - probs[a])  # Chosen action
            else:
                grad[a] = -state * probs[a]  # Other actions

        return grad


# Demonstrate gradient direction
policy = LinearSoftmaxPolicy(state_dim=3, n_actions=4)
state = np.array([1.0, 0.5, -0.3])
action = 2  # Suppose we took action 2

print(f"Current probabilities: {policy.probabilities(state).round(3)}")
grad = policy.gradient_log_prob(state, action)
print(f"Gradient shape: {grad.shape}")
print(f"Gradient for action {action} (should be positive): {grad[action]}")

Summary

Stochastic policies are central to policy gradient methods:

  • Representation: Probability distributions over actions (softmax for discrete, Gaussian for continuous)
  • Temperature: Controls exploration vs exploitation
  • Parameters: Learnable weights that we optimize via gradients
  • Advantages: Natural exploration, smooth gradients, genuine optimality in some settings

The next question is: given a parameterized stochastic policy, what exactly are we trying to optimize? That’s the policy objective.