Chapter 113
📝Draft

Q-Learning

Off-policy TD control: learning the optimal policy while exploring

Prerequisites:

Q-Learning: Off-Policy TD Control

What You'll Learn

  • Explain the key difference between SARSA and Q-learning
  • Implement Q-learning for control
  • Explain what “off-policy” means and why it matters
  • Demonstrate Q-learning on Cliff Walking
  • Identify the “deadly triad” and its implications

SARSA is safe but learns the value of being epsilon-greedy. What if we could learn the optimal policy while still exploring? Q-learning does exactly this—it learns QQ^* regardless of what behavior policy we follow. It’s perhaps the most important algorithm in reinforcement learning.

Q-Learning vs SARSA: CliffWalking

Compare how Q-Learning (off-policy) and SARSA (on-policy) learn different paths.

🤖
X
X
X
X
X
X
X
X
X
X
🎯
Episodes: 0Reward: 0
🤖Agent
🎯Goal
Cliff (-100)
Start
0
Episodes
0
Steps
0
Episode Reward
Key Insight: Q-Learning learns the optimal policy (walking along the cliff edge) because it uses the maximum Q-value for updates, regardless of the exploration policy. SARSA learns a safer path (going up first) because its updates reflect the actual exploratory actions taken, which might accidentally fall off the cliff.

Why Q-Learning?

Q-learning’s key innovation is using max over next actions instead of the actual next action. This means we’re always learning toward the optimal policy, even while behaving suboptimally. The behavior policy can be anything—as long as it tries all actions eventually.

📖Q-Learning

An off-policy TD control algorithm that learns the optimal action-value function QQ^* directly. Unlike SARSA, Q-learning uses the maximum Q-value for the next state, learning about the greedy policy while following an exploratory one.

The Q-Learning Update

Mathematical Details

The Q-learning update rule:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]

The difference from SARSA is just one word: max instead of the actual next action At+1A_{t+1}. But this one change is profound—it means we learn QQ^* instead of QπQ^\pi.

</>Implementation
def q_learning_update(Q, state, action, reward, next_state,
                      alpha=0.1, gamma=0.99, done=False):
    """Single Q-learning update."""
    if done:
        td_target = reward
    else:
        # Key difference from SARSA: use max, not the actual next action
        td_target = reward + gamma * np.max(Q[next_state])

    td_error = td_target - Q[state][action]
    Q[state][action] += alpha * td_error
    return td_error

Chapter Overview

This chapter introduces Q-learning—the foundation of modern deep RL. We’ll cover:

Prerequisites

This chapter assumes you’re comfortable with:

Key Questions We’ll Answer

  • What does “off-policy” mean and why is it powerful?
  • How does one word change (max vs AA') transform the algorithm?
  • Why does Q-learning find the risky path in Cliff Walking?
  • What is the “deadly triad” and why should we care?

The Big Picture

The relationship between SARSA and Q-learning parallels a fundamental question in RL:

Should we learn about what we do, or what we should do?

  • SARSA learns about the policy being followed (including exploration)
  • Q-learning learns about the optimal policy (ignoring exploration)

Both are valid choices with different trade-offs. Q-learning finds the true optimal, but ignores exploration risk. SARSA is safer during training but may not find the best policy.

SARSA
Target:Q(S’, A’)
Learns:Q^π
Type:On-policy
Safety:Safer (accounts for ε)
Optimal:Eventually (as ε→0)
Q-Learning
Target:max_a Q(S’, a)
Learns:Q*
Type:Off-policy
Safety:Riskier during training
Optimal:Yes ✓

Key Takeaways

  • Q-learning uses maxaQ(S,a)\max_{a'} Q(S', a') instead of Q(S,A)Q(S', A')
  • This makes it off-policy: it learns QQ^* regardless of behavior
  • Q-learning finds optimal policies but ignores exploration risk
  • The deadly triad (off-policy + function approximation + bootstrapping) can cause instability
  • Double Q-learning addresses maximization bias
  • Q-learning is the foundation for DQN and modern deep RL
Next ChapterFunction Approximation