Stochastic Policies
A stochastic policy outputs a probability distribution over actions rather than a single action. Instead of “always do X,” the policy says “do X with probability 70%, Y with probability 30%.”
This is more than just an exploration trick - stochastic policies are fundamental to policy gradient methods and sometimes represent the truly optimal behavior.
What Is a Stochastic Policy?
A stochastic policy is a mapping from states to probability distributions over actions:
For each state , is a probability distribution satisfying:
- for all actions
- (discrete) or (continuous)
Think of a stochastic policy as a “soft” decision-maker. Rather than committing absolutely to one action, it hedges its bets:
“Based on what I see, I’m pretty confident action A is best (60%), but B might work (30%), and there’s a small chance C is actually right (10%).”
This uncertainty can come from:
- Exploration: Trying new things to learn
- Genuine uncertainty: When you’re not sure what’s best
- Optimality: When randomness is actually the best strategy
Softmax Policies for Discrete Actions
The most common way to represent a stochastic policy over discrete actions is the softmax function.
Given preference values for each action (also called logits), the softmax policy is:
Properties:
- All probabilities are positive:
- Probabilities sum to 1
- Higher preference means higher probability
- Differentiable with respect to
Common choices for :
- Linear: where is a feature vector
- Neural network: where outputs logits for all actions
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class SoftmaxPolicy(nn.Module):
"""
Softmax policy for discrete actions using a neural network.
Takes states as input, outputs action probabilities.
"""
def __init__(self, state_dim, n_actions, hidden_dim=64):
super().__init__()
self.n_actions = n_actions
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_actions) # Outputs logits (preferences)
)
def forward(self, state):
"""Compute action probabilities."""
logits = self.network(state)
probs = F.softmax(logits, dim=-1)
return probs
def get_distribution(self, state):
"""Return a Categorical distribution for sampling."""
probs = self.forward(state)
return torch.distributions.Categorical(probs)
def sample(self, state):
"""Sample an action from the policy."""
dist = self.get_distribution(state)
action = dist.sample()
return action.item()
def log_prob(self, state, action):
"""Compute log probability of an action."""
dist = self.get_distribution(state)
return dist.log_prob(action)
# Example: 4-dimensional state, 3 possible actions
policy = SoftmaxPolicy(state_dim=4, n_actions=3)
# Get probabilities for a state
state = torch.tensor([[1.0, 0.5, -0.3, 0.8]])
probs = policy(state)
print(f"Action probabilities: {probs.detach().numpy()}")
# Sample multiple actions to see the distribution
actions = [policy.sample(state) for _ in range(1000)]
print(f"Action counts: {np.bincount(actions, minlength=3)}")The Temperature Parameter
The softmax function has an implicit temperature parameter that controls how “peaked” the distribution is.
Temperature works like this:
- High temperature: More uniform distribution (more exploration)
- Low temperature: More peaked distribution (more exploitation)
- Temperature approaches 0: Becomes deterministic (always picks highest preference)
- Temperature approaches infinity: Becomes uniform random
It’s like adjusting how “confident” the policy is in its preferences.
The softmax with explicit temperature :
As : (deterministic)
As : (random)
def softmax_with_temperature(logits, temperature=1.0):
"""
Apply softmax with temperature scaling.
Args:
logits: Unnormalized log-probabilities (preferences)
temperature: Higher = more uniform, lower = more peaked
Returns:
Probability distribution over actions
"""
scaled_logits = logits / temperature
# Subtract max for numerical stability
scaled_logits = scaled_logits - scaled_logits.max(dim=-1, keepdim=True)[0]
exp_logits = torch.exp(scaled_logits)
return exp_logits / exp_logits.sum(dim=-1, keepdim=True)
# Demonstrate temperature effects
logits = torch.tensor([[2.0, 1.0, 0.5]]) # Prefer action 0
print("Temperature effects on action probabilities:")
for temp in [0.1, 0.5, 1.0, 2.0, 10.0]:
probs = softmax_with_temperature(logits, temp)
print(f" T={temp:4.1f}: {probs.numpy().round(3)}")
# Output:
# T= 0.1: [[0.999 0. 0. ]]
# T= 0.5: [[0.953 0.043 0.004]]
# T= 1.0: [[0.659 0.242 0.099]]
# T= 2.0: [[0.466 0.308 0.226]]
# T=10.0: [[0.356 0.331 0.313]]Gaussian Policies for Continuous Actions
For continuous action spaces, we typically use Gaussian (Normal) policies.
A diagonal Gaussian policy outputs a mean and standard deviation for each action dimension:
For a single action dimension:
The log-probability (used in policy gradients) is:
A Gaussian policy says: “I think the action should be around , but I’m not certain - the action could be anywhere within a few of that.”
- Mean : The “intended” action
- Standard deviation : How much to explore around it
As training progresses, often decreases - the policy becomes more confident and exploits more.
import torch
import torch.nn as nn
import numpy as np
class DiagonalGaussianPolicy(nn.Module):
"""
Gaussian policy with diagonal covariance for continuous actions.
Common design choices:
- Mean is a function of state (neural network)
- Log-std can be state-dependent or a learned constant
"""
def __init__(self, state_dim, action_dim, hidden_dim=64,
log_std_min=-20, log_std_max=2):
super().__init__()
self.action_dim = action_dim
self.log_std_min = log_std_min
self.log_std_max = log_std_max
# Shared features
self.features = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
)
# Mean head
self.mean_head = nn.Linear(hidden_dim, action_dim)
# Log std as a learnable parameter (state-independent)
# Alternative: make it state-dependent with another head
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, state):
"""Return mean and std of action distribution."""
features = self.features(state)
mean = self.mean_head(features)
# Clamp log_std for numerical stability
log_std = torch.clamp(self.log_std, self.log_std_min, self.log_std_max)
std = torch.exp(log_std)
return mean, std
def get_distribution(self, state):
"""Return a Normal distribution for sampling."""
mean, std = self.forward(state)
return torch.distributions.Normal(mean, std)
def sample(self, state, deterministic=False):
"""
Sample action from policy.
Args:
state: Current state
deterministic: If True, return mean (no sampling)
"""
mean, std = self.forward(state)
if deterministic:
return mean
else:
noise = torch.randn_like(mean)
return mean + std * noise
def log_prob(self, state, action):
"""Compute log probability of action."""
dist = self.get_distribution(state)
# Sum log probs across action dimensions
return dist.log_prob(action).sum(dim=-1)
# Example: Continuous control with 3D state, 2D action
policy = DiagonalGaussianPolicy(state_dim=3, action_dim=2)
state = torch.tensor([[1.0, -0.5, 0.3]])
mean, std = policy(state)
print(f"Mean action: {mean.detach().numpy()}")
print(f"Standard deviation: {std.detach().numpy()}")
# Sample actions
actions = torch.stack([policy.sample(state) for _ in range(5)])
print(f"Sampled actions:\n{actions.detach().numpy()}")Why Stochastic Policies?
1. Exploration
The most obvious reason: stochastic policies naturally explore. Instead of always taking the same action in a state, they occasionally try alternatives.
Unlike epsilon-greedy (which is an ad-hoc addition to deterministic policies), stochasticity is built into the policy itself. And importantly, the amount of exploration can be learned - the policy can become more deterministic as it becomes more confident.
2. Policy Gradients Require Them
The policy gradient theorem requires computing . For a deterministic policy where :
- The gradient is zero almost everywhere (no learning signal)
- The gradient is undefined at the discontinuity
Stochastic policies have smooth, well-defined gradients everywhere. This is essential for gradient-based optimization.
3. Genuine Optimality
Sometimes, randomness is the best strategy:
Game theory: In rock-paper-scissors, the Nash equilibrium is playing each option with probability 1/3. Any deterministic strategy can be exploited.
Partial observability: When you can’t distinguish between states that require different actions, randomizing can be optimal.
Multi-modal rewards: If multiple actions are equally good, a stochastic policy can capture this uncertainty.
The Stochastic Gridworld
Consider a gridworld where:
- Two paths lead to the goal with equal expected reward
- Path A is shorter but riskier (50% chance of penalty)
- Path B is longer but safe
A deterministic policy must commit to one path. A stochastic policy can mix between them based on the actual risk-reward tradeoff, potentially achieving better expected performance under distribution shift.
Parameterized Policies
A parameterized policy is a policy whose behavior is determined by a vector of parameters . By adjusting , we change what actions the policy prefers.
The parameters can be:
- Linear weights: where
- Neural network weights: All weights and biases in a deep network
- Any differentiable function parameters
The key requirement is that must be differentiable with respect to . This enables gradient-based learning.
# Example: Simple linear softmax policy
class LinearSoftmaxPolicy:
"""
Linear softmax policy (no neural network).
h(s, a) = theta[a] @ s
pi(a|s) = softmax(h(s, :))
"""
def __init__(self, state_dim, n_actions):
self.n_actions = n_actions
# One weight vector per action
self.theta = np.zeros((n_actions, state_dim))
def preferences(self, state):
"""Compute preference for each action: theta @ state."""
return self.theta @ state
def probabilities(self, state):
"""Convert preferences to probabilities via softmax."""
h = self.preferences(state)
h = h - np.max(h) # Numerical stability
exp_h = np.exp(h)
return exp_h / np.sum(exp_h)
def sample(self, state):
"""Sample action from policy."""
probs = self.probabilities(state)
return np.random.choice(self.n_actions, p=probs)
def gradient_log_prob(self, state, action):
"""
Compute gradient of log pi(a|s) w.r.t. theta.
For softmax: grad_theta log pi(a|s) = phi(s,a) - E_pi[phi(s,a')]
where phi(s,a) is the feature for (s,a) pair.
"""
probs = self.probabilities(state)
grad = np.zeros_like(self.theta)
# For linear policy: phi(s,a) = s for the a-th row
for a in range(self.n_actions):
if a == action:
grad[a] = state * (1 - probs[a]) # Chosen action
else:
grad[a] = -state * probs[a] # Other actions
return grad
# Demonstrate gradient direction
policy = LinearSoftmaxPolicy(state_dim=3, n_actions=4)
state = np.array([1.0, 0.5, -0.3])
action = 2 # Suppose we took action 2
print(f"Current probabilities: {policy.probabilities(state).round(3)}")
grad = policy.gradient_log_prob(state, action)
print(f"Gradient shape: {grad.shape}")
print(f"Gradient for action {action} (should be positive): {grad[action]}")Summary
Stochastic policies are central to policy gradient methods:
- Representation: Probability distributions over actions (softmax for discrete, Gaussian for continuous)
- Temperature: Controls exploration vs exploitation
- Parameters: Learnable weights that we optimize via gradients
- Advantages: Natural exploration, smooth gradients, genuine optimality in some settings
The next question is: given a parameterized stochastic policy, what exactly are we trying to optimize? That’s the policy objective.