Chapter 203
📝Draft

Actor-Critic Methods

Combining policy and value learning for stability

Prerequisites:

Actor-Critic Methods

What You'll Learn

  • Explain the actor-critic architecture and why it helps
  • Define the advantage function and understand its benefits
  • Implement Advantage Actor-Critic (A2C) from scratch
  • Understand how TD learning provides bootstrap targets
  • Navigate the bias-variance tradeoff in actor-critic methods

REINFORCE taught us to weight log-probabilities by returns. But waiting for episode end is slow, and returns are noisy. What if we had a critic - a value function that could estimate how good a state is? We could get feedback every step, not just at episode end.

Why Actor-Critic?

Actor-critic methods combine the best of both worlds:

  • The Actor (policy) learns which actions to take
  • The Critic (value function) learns how good states are

The critic provides low-variance estimates of action quality, enabling the actor to learn faster and more stably than pure REINFORCE.

Chapter Overview

This chapter introduces actor-critic methods, the workhorse architecture of modern deep RL. We’ll cover the advantage function, A2C, and techniques for balancing bias and variance.

The Big Picture

Think of a student (actor) and a teacher (critic):

  • The student tries actions and learns from feedback
  • The teacher evaluates situations and provides guidance

The teacher doesn’t tell the student exactly what to do, but says “that situation looked promising” or “you were in trouble there.” This guidance helps the student learn faster than trial-and-error alone.

📖Actor-Critic

A family of algorithms where the actor learns a policy πθ(as)\pi_\theta(a|s) and the critic learns a value function Vϕ(s)V_\phi(s). The critic’s estimates guide the actor’s learning.

The advantage function captures “how much better than average” an action is:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

Prerequisites

This chapter assumes familiarity with:

  • The Policy Gradient Theorem from REINFORCE
  • Baselines and variance reduction
  • TD learning concepts (helpful but not required)

Key Takeaways

  • Actor-critic combines policy gradient (actor) with value learning (critic)
  • The advantage function A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s) measures relative action quality
  • TD error δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) estimates the advantage
  • A2C enables online learning (no need to wait for episode end)
  • n-step returns and GAE balance bias and variance
Next ChapterProximal Policy Optimization