Q-Learning: Off-Policy TD Control

What You'll Learn

Explain the key difference between SARSA and Q-learning
Implement Q-learning for control
Explain what “off-policy” means and why it matters
Demonstrate Q-learning on Cliff Walking
Identify the “deadly triad” and its implications

SARSA is safe but learns the value of being epsilon-greedy. What if we could learn the optimal policy while still exploring? Q-learning does exactly this—it learns $Q^*$ regardless of what behavior policy we follow. It’s perhaps the most important algorithm in reinforcement learning.

Q-Learning vs SARSA: CliffWalking

Compare how Q-Learning (off-policy) and SARSA (on-policy) learn different paths.

🤖

🎯

Episodes: 0Reward: 0

🤖Agent

🎯Goal

Cliff (-100)

Start

Episodes

Steps

Episode Reward

Speed:

Key Insight: Q-Learning learns the optimal policy (walking along the cliff edge) because it uses the maximum Q-value for updates, regardless of the exploration policy. SARSA learns a safer path (going up first) because its updates reflect the actual exploratory actions taken, which might accidentally fall off the cliff.

Why Q-Learning?

Q-learning’s key innovation is using max over next actions instead of the actual next action. This means we’re always learning toward the optimal policy, even while behaving suboptimally. The behavior policy can be anything—as long as it tries all actions eventually.

📖Q-Learning

An off-policy TD control algorithm that learns the optimal action-value function $Q^*$ directly. Unlike SARSA, Q-learning uses the maximum Q-value for the next state, learning about the greedy policy while following an exploratory one.

The Q-Learning Update

∑Mathematical Details

The Q-learning update rule:

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]$

The difference from SARSA is just one word: max instead of the actual next action $A_{t+1}$ . But this one change is profound—it means we learn $Q^*$ instead of $Q^\pi$ .

</>Implementation

def q_learning_update(Q, state, action, reward, next_state,
                      alpha=0.1, gamma=0.99, done=False):
    """Single Q-learning update."""
    if done:
        td_target = reward
    else:
        # Key difference from SARSA: use max, not the actual next action
        td_target = reward + gamma * np.max(Q[next_state])

    td_error = td_target - Q[state][action]
    Q[state][action] += alpha * td_error
    return td_error

Chapter Overview

This chapter introduces Q-learning—the foundation of modern deep RL. We’ll cover:

Prerequisites

This chapter assumes you’re comfortable with:

SARSA — The on-policy TD control algorithm
Introduction to TD Learning — TD updates and the TD error

Key Questions We’ll Answer

What does “off-policy” mean and why is it powerful?
How does one word change (max vs $A'$ ) transform the algorithm?
Why does Q-learning find the risky path in Cliff Walking?
What is the “deadly triad” and why should we care?

The Big Picture

The relationship between SARSA and Q-learning parallels a fundamental question in RL:

Should we learn about what we do, or what we should do?

SARSA learns about the policy being followed (including exploration)
Q-learning learns about the optimal policy (ignoring exploration)

Both are valid choices with different trade-offs. Q-learning finds the true optimal, but ignores exploration risk. SARSA is safer during training but may not find the best policy.

SARSA

Target:Q(S’, A’)

Learns:Q^π

Type:On-policy

Safety:Safer (accounts for ε)

Optimal:Eventually (as ε→0)

Q-Learning

Target:max_a Q(S’, a)

Learns:Q*

Type:Off-policy

Safety:Riskier during training

Optimal:Yes ✓

Key Takeaways

Q-learning uses $\max_{a'} Q(S', a')$ instead of $Q(S', A')$
This makes it off-policy: it learns $Q^*$ regardless of behavior
Q-learning finds optimal policies but ignores exploration risk
The deadly triad (off-policy + function approximation + bootstrapping) can cause instability
Double Q-learning addresses maximization bias
Q-learning is the foundation for DQN and modern deep RL

Next ChapterFunction Approximation→

Q-Learning

Q-Learning: Off-Policy TD Control

What You'll Learn

Q-Learning vs SARSA: CliffWalking

Why Q-Learning?

The Q-Learning Update

Chapter Overview

The Q-Learning Idea

The Q-Learning Algorithm

SARSA vs Q-Learning

Convergence and the Deadly Triad

Prerequisites

Key Questions We’ll Answer

The Big Picture

Key Takeaways