Actor-Critic Methods
What You'll Learn
- Explain the actor-critic architecture and why it helps
- Define the advantage function and understand its benefits
- Implement Advantage Actor-Critic (A2C) from scratch
- Understand how TD learning provides bootstrap targets
- Navigate the bias-variance tradeoff in actor-critic methods
REINFORCE taught us to weight log-probabilities by returns. But waiting for episode end is slow, and returns are noisy. What if we had a critic - a value function that could estimate how good a state is? We could get feedback every step, not just at episode end.
Why Actor-Critic?
Actor-critic methods combine the best of both worlds:
- The Actor (policy) learns which actions to take
- The Critic (value function) learns how good states are
The critic provides low-variance estimates of action quality, enabling the actor to learn faster and more stably than pure REINFORCE.
Chapter Overview
This chapter introduces actor-critic methods, the workhorse architecture of modern deep RL. We’ll cover the advantage function, A2C, and techniques for balancing bias and variance.
The Actor-Critic Idea
Two networks working together
Advantage Functions
How much better is this action than average?
Advantage Actor-Critic (A2C)
Synchronous actor-critic training
Generalized Advantage Estimation
Balancing bias and variance in advantage estimation
The Big Picture
Think of a student (actor) and a teacher (critic):
- The student tries actions and learns from feedback
- The teacher evaluates situations and provides guidance
The teacher doesn’t tell the student exactly what to do, but says “that situation looked promising” or “you were in trouble there.” This guidance helps the student learn faster than trial-and-error alone.
A family of algorithms where the actor learns a policy and the critic learns a value function . The critic’s estimates guide the actor’s learning.
The advantage function captures “how much better than average” an action is:
Prerequisites
This chapter assumes familiarity with:
- The Policy Gradient Theorem from REINFORCE
- Baselines and variance reduction
- TD learning concepts (helpful but not required)
Key Takeaways
- Actor-critic combines policy gradient (actor) with value learning (critic)
- The advantage function measures relative action quality
- TD error estimates the advantage
- A2C enables online learning (no need to wait for episode end)
- n-step returns and GAE balance bias and variance