The Actor-Critic Idea
REINFORCE with a baseline learns a value function to reduce variance. But we still wait until the end of each episode to update. What if we could update at every step, using the value function more directly?
This is the actor-critic idea: combine a policy (actor) with a value function (critic) to get the best of both worlds.
The Two Components
Actor: The policy network that decides what action to take. It’s called the “actor” because it determines how the agent acts.
Critic: The value network that evaluates how good states are. It “criticizes” the actor by estimating expected returns, providing feedback for learning.
Together, they form a powerful learning system where the critic guides the actor’s improvement.
Think of learning to play chess:
- Actor (your strategy): “In this position, I should probably castle, or maybe push a pawn…”
- Critic (your evaluation): “This position looks good for me - I’m up material and have better piece activity.”
The critic doesn’t tell you what move to make, but it does tell you whether you’re in a good position. This feedback helps the actor learn which moves lead to good positions.
In REINFORCE, we had an actor but the critic was implicit (just the returns). Now we make the critic explicit and trainable.
Why Combine Them?
The REINFORCE Problem
REINFORCE has a fundamental limitation: it uses Monte Carlo returns - we must wait until the end of the episode to know how good our actions were.
This causes several problems:
- Can’t learn mid-episode: We accumulate experience but can’t use it until the episode ends
- High variance: Returns incorporate randomness from many future steps
- Slow credit assignment: A good early action gets the same noisy signal as a mediocre late action
- Long episodes are painful: A 1000-step episode means 1000 steps before any update
The Value Function Solution
The value function estimates the expected return from state . If our estimate is good, we can use it to provide immediate feedback:
Instead of waiting to see the actual return:
We can use a bootstrapped estimate:
This is the one-step TD target. We observe one reward and then ask the critic: “How good is the next state?” This gives us usable signal at every step.
The TD error provides an advantage estimate:
This approximates because:
If we take the sampled reward and next state as our estimate:
Then:
The Actor-Critic Update
At each step, actor-critic performs two updates:
Actor update (policy gradient with advantage):
Critic update (TD learning):
The actor learns from the critic’s feedback. The critic learns from observed rewards. Both improve together.
Each step of actor-critic:
- Act: Sample action from current policy
- Observe: Get reward and next state
- Critique: Compute TD error (was reality better or worse than expected?)
- Update Actor: If TD error is positive, reinforce the action; if negative, suppress it
- Update Critic: Adjust value estimates toward observed reality
This happens at every timestep - no waiting for episode end!
Shared vs. Separate Networks
Should actor and critic share parameters?
Shared networks (common in practice):
- Lower layers extract features useful for both policy and value
- Reduces total parameters
- Acts as regularization
- Can sometimes destabilize training (competing objectives)
Separate networks:
- Complete independence between actor and critic
- Easier to tune learning rates separately
- No interference between objectives
- More parameters to train
Most implementations (including PPO) use shared networks with separate “heads” for policy and value outputs.
import torch
import torch.nn as nn
import torch.nn.functional as F
class ActorCriticNetwork(nn.Module):
"""
Combined actor-critic network with shared feature extraction.
Architecture:
State -> Shared Layers -> Actor Head (policy)
-> Critic Head (value)
"""
def __init__(self, state_dim, n_actions, hidden_dim=128):
super().__init__()
# Shared feature extraction layers
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
)
# Actor head: outputs action logits
self.actor_head = nn.Linear(hidden_dim, n_actions)
# Critic head: outputs single value
self.critic_head = nn.Linear(hidden_dim, 1)
def forward(self, state):
"""Forward pass returning both policy and value."""
features = self.shared(state)
action_logits = self.actor_head(features)
value = self.critic_head(features)
return action_logits, value.squeeze(-1)
def get_policy(self, state):
"""Get action distribution."""
action_logits, _ = self.forward(state)
return torch.distributions.Categorical(logits=action_logits)
def get_value(self, state):
"""Get state value."""
_, value = self.forward(state)
return value
def get_action_and_value(self, state):
"""Sample action and return log_prob, value."""
action_logits, value = self.forward(state)
dist = torch.distributions.Categorical(logits=action_logits)
action = dist.sample()
log_prob = dist.log_prob(action)
return action, log_prob, value
class SeparateActorCritic(nn.Module):
"""Separate actor and critic networks."""
def __init__(self, state_dim, n_actions, hidden_dim=128):
super().__init__()
# Actor network
self.actor = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, n_actions)
)
# Critic network
self.critic = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state):
action_logits = self.actor(state)
value = self.critic(state).squeeze(-1)
return action_logits, valueOnline vs. Batch Updates
Actor-critic can work in two modes:
Online (step-by-step):
- Update after every single step
- Maximum responsiveness to new data
- Can be unstable due to high variance
Batch (n-step or trajectory):
- Collect n steps of experience, then update
- More stable gradients from averaging
- Standard in modern implementations (PPO uses batches)
Most practical implementations use batched updates because they provide more stable learning while still being much faster than full-episode Monte Carlo.
def online_actor_critic_step(model, optimizer, state, action, reward,
next_state, done, gamma=0.99):
"""
Online (single-step) actor-critic update.
"""
# Forward pass
action_logits, value = model(state.unsqueeze(0))
if done:
next_value = 0.0
else:
with torch.no_grad():
_, next_value = model(next_state.unsqueeze(0))
# TD error (advantage estimate)
td_target = reward + gamma * next_value
advantage = td_target - value
# Actor loss: negative log prob weighted by advantage
dist = torch.distributions.Categorical(logits=action_logits)
log_prob = dist.log_prob(torch.tensor([action]))
actor_loss = -log_prob * advantage.detach()
# Critic loss: squared TD error
critic_loss = F.mse_loss(value, torch.tensor([td_target.item()]))
# Combined loss
loss = actor_loss + 0.5 * critic_loss
# Update
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item(), advantage.item()The Bias-Variance Tradeoff
Actor-critic introduces bias that REINFORCE doesn’t have.
REINFORCE uses true returns - unbiased but high variance.
Actor-Critic uses bootstrapped estimates - biased (because is approximate) but lower variance.
This is a fundamental tradeoff in RL. We’ll see how to navigate it with n-step returns and GAE in later sections.
The bias in actor-critic comes from using instead of the true expected value:
If , then .
However, as improves, the bias decreases. The learning process bootstraps - actor and critic improve together.
Think of it like trusting a friend’s advice:
- Monte Carlo (REINFORCE): Wait to see the actual outcome. Unbiased but might take forever.
- Actor-Critic: Trust your friend’s estimate of how things will go. Faster, but your friend might be wrong.
As you both learn together, your friend’s estimates get better, and so does your decision-making.
Advantages Over REINFORCE
| Aspect | REINFORCE | Actor-Critic |
|---|---|---|
| Update frequency | End of episode | Every step |
| Return estimation | Monte Carlo (full) | Bootstrapped (TD) |
| Variance | High | Lower |
| Bias | None | Some (decreasing) |
| Long episodes | Painful | No problem |
| Sample efficiency | Lower | Higher |
Summary
The actor-critic architecture combines policy gradient with value learning:
- Actor (): Learns what actions to take
- Critic (): Learns to evaluate states
- TD error: Provides low-variance advantage estimates
- Online updates: No need to wait for episode end
- Tradeoff: Some bias for much lower variance
This is the foundation for modern algorithms like A2C, A3C, and PPO. The next sections explore the advantage function in more detail and the A2C algorithm specifically.