Actor-Critic Methods
What You'll Learn
- Explain the actor-critic architecture and the roles of actor and critic
- Understand how the critic reduces variance compared to REINFORCE
- Derive and implement the advantage function
- Implement A2C (Advantage Actor-Critic) from scratch
- Understand the bias-variance trade-off in actor-critic methods
- Compare on-policy and off-policy actor-critic variants
The Best of Both Worlds
REINFORCE taught us how to compute policy gradients, but it has a fundamental weakness: high variance from Monte Carlo returns. We have to wait for complete episodes, and the returns we observe are noisy.
Meanwhile, TD learning showed us a different approach: bootstrap from value estimates instead of waiting for actual returns. This reduces variance dramatically but introduces some bias.
What if we could combine these ideas?
Actor-critic methods use two components:
- Actor: The policy that decides what to do
- Critic: A value function that evaluates how good the current state is
The critic helps the actor learn more efficiently. Instead of waiting for actual returns (high variance), we use the critic’s estimates (lower variance). The actor and critic learn together—the critic gets better at evaluation, and the actor gets better at acting.
Think of it like a player (actor) and a coach (critic). The coach evaluates performance and provides feedback, helping the player improve faster than if they just played games and looked at final scores.
The Advantage Function
The key insight of actor-critic is using the advantage function instead of raw returns.
Mathematical Details
The advantage function measures how much better an action is compared to the average:
Where:
- = expected return from taking action in state , then following
- = expected return from state following =
The advantage tells us: “How much better (or worse) is this specific action compared to what we’d typically get from this state?”
Properties of advantage:
- : action is better than average
- : action is worse than average
- : advantages are centered (average advantage is zero)
Why is advantage better than raw returns for policy gradients?
Consider a game where all outcomes give positive rewards (100, 105, 110, etc.). With raw returns:
- Action leading to 105: positive update
- Action leading to 100: still positive update!
The policy can’t distinguish “good” from “just positive.” With advantages:
- Action leading to 105 (average is 105): zero update
- Action leading to 110 (above average): positive update
- Action leading to 100 (below average): negative update
Now we’re properly rewarding above-average actions and penalizing below-average ones.
The TD Error as Advantage Estimate
Here’s the elegant connection to TD learning:
Mathematical Details
The TD error is:
Remarkably, this is an unbiased estimate of the advantage:
Proof sketch:
The first term is exactly (immediate reward + discounted value of next state), and subtracting gives us the advantage.
The TD error captures surprise:
- Expected outcome: Starting from , we expected to get this much value
- Actual outcome: We got reward and ended up in state , worth
- Surprise: The difference is how much better (or worse) things went
This surprise is exactly what we want to reinforce! If , things went better than expected—make the action more likely. If , things went worse—make the action less likely.
The TD error gives us advantage estimation for free! We just need a value function estimate, which we get from the critic. No need to estimate separately.
The A2C Algorithm
A2C (Advantage Actor-Critic) is a clean implementation of actor-critic ideas. Let’s build it step by step.
Mathematical Details
A2C Updates:
Actor update (policy gradient with advantage):
Critic update (TD learning):
Or equivalently, minimize the TD error squared:
where (using TD error as advantage estimate).
Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np
class ActorCriticNetwork(nn.Module):
"""
Neural network with separate actor (policy) and critic (value) heads.
Shares early layers for efficiency.
"""
def __init__(self, state_dim, n_actions, hidden_dim=128):
super().__init__()
# Shared feature extractor
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
# Actor head: outputs action logits
self.actor = nn.Linear(hidden_dim, n_actions)
# Critic head: outputs state value
self.critic = nn.Linear(hidden_dim, 1)
def forward(self, state):
"""Return action logits and state value."""
features = self.shared(state)
action_logits = self.actor(features)
value = self.critic(features)
return action_logits, value
def get_action_and_value(self, state):
"""Sample action and return action, log_prob, value."""
action_logits, value = self.forward(state)
dist = Categorical(logits=action_logits)
action = dist.sample()
log_prob = dist.log_prob(action)
entropy = dist.entropy()
return action.item(), log_prob, value.squeeze(), entropy
class A2C:
"""Advantage Actor-Critic algorithm."""
def __init__(self, state_dim, n_actions, lr=0.001, gamma=0.99,
value_coef=0.5, entropy_coef=0.01):
"""
Args:
state_dim: Dimension of state space
n_actions: Number of discrete actions
lr: Learning rate
gamma: Discount factor
value_coef: Coefficient for value loss
entropy_coef: Coefficient for entropy bonus (encourages exploration)
"""
self.gamma = gamma
self.value_coef = value_coef
self.entropy_coef = entropy_coef
self.network = ActorCriticNetwork(state_dim, n_actions)
self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
# Episode storage
self.log_probs = []
self.values = []
self.rewards = []
self.entropies = []
self.dones = []
def select_action(self, state):
"""Select action using current policy."""
state = torch.FloatTensor(state).unsqueeze(0)
action, log_prob, value, entropy = self.network.get_action_and_value(state)
self.log_probs.append(log_prob)
self.values.append(value)
self.entropies.append(entropy)
return action
def store_transition(self, reward, done):
"""Store reward and done flag."""
self.rewards.append(reward)
self.dones.append(done)
def compute_returns_and_advantages(self, next_value):
"""
Compute returns and advantages using TD(0).
For each step: advantage = reward + gamma * next_value - value
"""
advantages = []
returns = []
# Bootstrap from next value if episode didn't end
R = next_value
for t in reversed(range(len(self.rewards))):
if self.dones[t]:
R = 0 # Terminal state has zero future value
R = self.rewards[t] + self.gamma * R
returns.insert(0, R)
# TD advantage: r + gamma * V(s') - V(s)
if self.dones[t]:
next_val = 0
elif t == len(self.rewards) - 1:
next_val = next_value
else:
next_val = self.values[t + 1].item()
advantage = self.rewards[t] + self.gamma * next_val - self.values[t].item()
advantages.insert(0, advantage)
returns = torch.FloatTensor(returns)
advantages = torch.FloatTensor(advantages)
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
return returns, advantages
def update(self, next_state=None):
"""Perform A2C update."""
# Get next value for bootstrapping
if next_state is not None:
with torch.no_grad():
state = torch.FloatTensor(next_state).unsqueeze(0)
_, next_value = self.network(state)
next_value = next_value.item()
else:
next_value = 0
returns, advantages = self.compute_returns_and_advantages(next_value)
# Stack tensors
log_probs = torch.stack(self.log_probs)
values = torch.stack(self.values)
entropies = torch.stack(self.entropies)
# Policy loss: -log_prob * advantage
policy_loss = -(log_probs * advantages).mean()
# Value loss: MSE between predicted values and returns
value_loss = nn.functional.mse_loss(values, returns)
# Entropy bonus: encourage exploration
entropy_loss = -entropies.mean()
# Combined loss
loss = policy_loss + self.value_coef * value_loss + self.entropy_coef * entropy_loss
# Update network
self.optimizer.zero_grad()
loss.backward()
# Gradient clipping for stability
nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=0.5)
self.optimizer.step()
# Clear storage
self.log_probs = []
self.values = []
self.rewards = []
self.entropies = []
self.dones = []
return {
'policy_loss': policy_loss.item(),
'value_loss': value_loss.item(),
'entropy': -entropy_loss.item()
}Training A2C
Implementation
import gymnasium as gym
def train_a2c(env_name='CartPole-v1', num_episodes=1000, update_freq=5):
"""
Train A2C agent.
Args:
env_name: Gymnasium environment name
num_episodes: Number of training episodes
update_freq: Update every N steps (or at episode end)
"""
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n
agent = A2C(state_dim, n_actions, lr=0.001, gamma=0.99)
episode_rewards = []
for episode in range(num_episodes):
state, _ = env.reset()
done = False
total_reward = 0
step = 0
while not done:
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.store_transition(reward, done)
total_reward += reward
step += 1
# Update periodically or at episode end
if step % update_freq == 0 or done:
agent.update(next_state if not done else None)
state = next_state
episode_rewards.append(total_reward)
if episode % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode}, Avg Reward: {avg_reward:.1f}")
return agent, episode_rewards
# Train the agent
agent, rewards = train_a2c(num_episodes=1000)Bias-Variance Trade-off
Actor-critic methods introduce a trade-off that’s important to understand.
REINFORCE (Monte Carlo):
- Uses actual returns
- Unbiased: exactly
- High variance: Returns vary a lot across episodes
Actor-Critic (TD):
- Uses bootstrapped estimate
- Biased: If is wrong, the estimate is wrong
- Lower variance: is a learned, stable function
The bias comes from using a learned value function that might be inaccurate. The variance reduction comes from replacing random future outcomes with a stable estimate.
N-Step Returns
We can interpolate between TD(0) and Monte Carlo using n-step returns:
Mathematical Details
n-step return:
- : TD(0),
- : Monte Carlo,
n-step advantage:
Higher :
- Less bias (more actual rewards, less bootstrapping)
- More variance (more randomness from actual trajectory)
Implementation
def compute_n_step_returns(rewards, values, gamma, n_steps):
"""
Compute n-step returns.
Args:
rewards: List of rewards [r_0, r_1, ..., r_T]
values: List of value estimates [V(s_0), V(s_1), ..., V(s_T)]
gamma: Discount factor
n_steps: Number of steps for n-step returns
"""
T = len(rewards)
returns = []
for t in range(T):
# Sum of discounted rewards for n steps
G = 0
for k in range(min(n_steps, T - t)):
G += (gamma ** k) * rewards[t + k]
# Bootstrap from value if we haven't reached the end
if t + n_steps < T:
G += (gamma ** n_steps) * values[t + n_steps]
returns.append(G)
return returnsIn practice, n-step returns with between 5 and 20 often work well. The optimal depends on the environment—longer horizons for problems where rewards are sparse or delayed.
Entropy Regularization
You may have noticed the entropy_coef in our A2C implementation. What’s that about?
Entropy measures how random a probability distribution is:
- High entropy = spread out, lots of uncertainty
- Low entropy = peaked, very certain
For a policy, high entropy means exploring many actions; low entropy means consistently picking the same action.
By adding an entropy bonus to our objective, we encourage the policy to maintain some randomness:
where .
This prevents premature convergence to a suboptimal deterministic policy. The policy keeps exploring even as it improves.
Mathematical Details
The entropy of a discrete policy is:
Adding this to the policy gradient objective:
The gradient encourages:
- Increasing probability of advantageous actions (from advantage term)
- Keeping the policy stochastic (from entropy term)
As training progresses, we might decrease to allow the policy to become more deterministic.
Too much entropy regularization can prevent the policy from ever converging. Too little can cause it to collapse to a deterministic (possibly suboptimal) policy too quickly. Values of between 0.001 and 0.01 are common starting points.
Shared vs. Separate Networks
There’s an architectural choice in actor-critic: should the actor and critic share parameters?
Shared networks (common layers, separate heads):
- More parameter-efficient
- Shared representations can help both actor and critic
- Risk: critic updates might hurt actor’s representations (and vice versa)
Separate networks (completely independent):
- No interference between actor and critic learning
- More parameters, more memory
- Can use different architectures/learning rates
In practice, shared networks work well for many problems, especially with careful loss weighting. The value coefficient (e.g., 0.5) balances how much the critic loss affects shared layers.
Comparison: REINFORCE vs. A2C
Let’s compare the two approaches:
Implementation
def compare_reinforce_a2c(env_name='CartPole-v1', num_episodes=500, num_seeds=3):
"""Compare REINFORCE and A2C on the same environment."""
import matplotlib.pyplot as plt
results = {'REINFORCE': [], 'A2C': []}
for seed in range(num_seeds):
torch.manual_seed(seed)
np.random.seed(seed)
# Train REINFORCE
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n
# Using REINFORCE with baseline from previous chapter
reinforce_agent = REINFORCEWithBaseline(state_dim, n_actions, lr=0.01)
reinforce_rewards = []
for _ in range(num_episodes):
state, _ = env.reset()
done = False
ep_reward = 0
while not done:
action = reinforce_agent.select_action(state)
state, reward, term, trunc, _ = env.step(action)
reinforce_agent.store_reward(reward)
ep_reward += reward
done = term or trunc
reinforce_agent.update()
reinforce_rewards.append(ep_reward)
results['REINFORCE'].append(reinforce_rewards)
# Train A2C
torch.manual_seed(seed)
np.random.seed(seed)
env = gym.make(env_name)
a2c_agent = A2C(state_dim, n_actions, lr=0.001)
a2c_rewards = []
for _ in range(num_episodes):
state, _ = env.reset()
done = False
ep_reward = 0
steps = 0
while not done:
action = a2c_agent.select_action(state)
next_state, reward, term, trunc, _ = env.step(action)
done = term or trunc
a2c_agent.store_transition(reward, done)
ep_reward += reward
steps += 1
if steps % 5 == 0 or done:
a2c_agent.update(next_state if not done else None)
state = next_state
a2c_rewards.append(ep_reward)
results['A2C'].append(a2c_rewards)
# Plot comparison
plt.figure(figsize=(10, 6))
colors = {'REINFORCE': 'blue', 'A2C': 'orange'}
for name, runs in results.items():
mean = np.mean(runs, axis=0)
std = np.std(runs, axis=0)
# Smooth for visualization
window = 20
smoothed = np.convolve(mean, np.ones(window)/window, mode='valid')
x = np.arange(len(smoothed))
plt.plot(x, smoothed, label=name, color=colors[name])
plt.fill_between(x,
np.convolve(mean - std, np.ones(window)/window, mode='valid'),
np.convolve(mean + std, np.ones(window)/window, mode='valid'),
alpha=0.2, color=colors[name])
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('REINFORCE vs A2C on CartPole')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()| Aspect | REINFORCE | A2C |
|---|---|---|
| Update frequency | End of episode | Every few steps |
| Variance | High (Monte Carlo) | Lower (bootstrapping) |
| Bias | None | Some (from value function) |
| Sample efficiency | Poor | Better |
| Stability | Less stable | More stable |
| Complexity | Simpler | Slightly more complex |
Summary
Key Takeaways
- Actor-critic combines policy learning (actor) with value function learning (critic)
- The advantage function measures how much better an action is than average
- The TD error is an unbiased estimate of advantage
- A2C updates both actor and critic, using TD error for advantages
- There’s a bias-variance trade-off: bootstrapping reduces variance but introduces bias if the value function is inaccurate
- N-step returns interpolate between TD (low variance, some bias) and Monte Carlo (no bias, high variance)
- Entropy regularization prevents premature convergence to deterministic policies
- Actor and critic can share network layers for efficiency, or use separate networks for stability
Actor-critic methods are the foundation of most modern RL algorithms. But there’s still a stability issue: if the actor changes too much, learning can collapse. One bad update can undo hours of progress.
How do we take the largest useful steps while staying safe? That’s the question PPO and trust region methods answer—and they’re the focus of our next chapter.
Exercises
Conceptual Questions
-
Explain the actor-critic architecture. What role does each component play? Why is this combination powerful?
-
Why is the TD error an estimate of advantage? Show that .
-
What is the bias-variance trade-off in actor-critic? Why does bootstrapping reduce variance? Why does it introduce bias?
-
Compare actor-critic to Q-learning. What are the similarities? The differences?
Coding Challenges
-
Implement A2C from scratch and train on CartPole. Compare learning curves to REINFORCE with baseline.
-
Experiment with n-step returns. Compare n=1, n=5, n=20, and Monte Carlo. Which learns fastest? Which is most stable?
-
Implement entropy regularization. Compare learning with , , and . How does entropy affect exploration?
Exploration
-
Learning rate balance. Try different learning rates for actor and critic (use separate optimizers). What happens when the critic learns much faster than the actor? Much slower?
-
Shared vs. separate networks. Implement A2C with completely separate actor and critic networks. Compare to shared networks. When might separate networks be preferable?