Actor-Critic Methods

What You'll Learn

Explain the actor-critic architecture and why it helps
Define the advantage function and understand its benefits
Implement Advantage Actor-Critic (A2C) from scratch
Understand how TD learning provides bootstrap targets
Navigate the bias-variance tradeoff in actor-critic methods

REINFORCE taught us to weight log-probabilities by returns. But waiting for episode end is slow, and returns are noisy. What if we had a critic - a value function that could estimate how good a state is? We could get feedback every step, not just at episode end.

Why Actor-Critic?

Actor-critic methods combine the best of both worlds:

The Actor (policy) learns which actions to take
The Critic (value function) learns how good states are

The critic provides low-variance estimates of action quality, enabling the actor to learn faster and more stably than pure REINFORCE.

Chapter Overview

This chapter introduces actor-critic methods, the workhorse architecture of modern deep RL. We’ll cover the advantage function, A2C, and techniques for balancing bias and variance.

The Actor-Critic Idea

Two networks working together

Advantage Functions

How much better is this action than average?

Advantage Actor-Critic (A2C)

Synchronous actor-critic training

Generalized Advantage Estimation

Balancing bias and variance in advantage estimation

The Big Picture

Think of a student (actor) and a teacher (critic):

The student tries actions and learns from feedback
The teacher evaluates situations and provides guidance

The teacher doesn’t tell the student exactly what to do, but says “that situation looked promising” or “you were in trouble there.” This guidance helps the student learn faster than trial-and-error alone.

📖Actor-Critic

A family of algorithms where the actor learns a policy $\pi_\theta(a|s)$ and the critic learns a value function $V_\phi(s)$ . The critic’s estimates guide the actor’s learning.

The advantage function captures “how much better than average” an action is:

$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$

Prerequisites

This chapter assumes familiarity with:

The Policy Gradient Theorem from REINFORCE
Baselines and variance reduction
TD learning concepts (helpful but not required)

Key Takeaways

Actor-critic combines policy gradient (actor) with value learning (critic)
The advantage function $A(s,a) = Q(s,a) - V(s)$ measures relative action quality
TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ estimates the advantage
A2C enables online learning (no need to wait for episode end)
n-step returns and GAE balance bias and variance

Next ChapterProximal Policy Optimization→