Temporal Difference Learning • Part 1 of 4
📝Draft

The Q-Learning Idea

Learning optimal values regardless of behavior

The Q-Learning Idea

Q-learning is perhaps the most influential algorithm in reinforcement learning. It’s the foundation of DQN, which famously learned to play Atari games, and its descendants power many modern AI systems. The core idea is elegant: learn the optimal value function directly, regardless of how you’re actually behaving.

From SARSA to Q-Learning

Recall SARSA’s approach:

  • Take action AA in state SS
  • Observe reward RR and next state SS'
  • Choose next action AA' from your policy
  • Update Q(S,A)Q(S, A) using Q(S,A)Q(S', A')

SARSA uses the action you’ll actually take next. This means if your policy sometimes explores randomly, the Q-values account for that exploration.

Q-learning changes one thing: instead of using the next action AA', use the best action:

maxaQ(S,a)\max_{a'} Q(S', a')

This single change transforms the algorithm from on-policy to off-policy.

The One-Word Difference

Mathematical Details

SARSA (on-policy): Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]

Q-Learning (off-policy): Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]

The only difference: max replaces the actual next action At+1A_{t+1}.

But this small change has profound implications:

  • SARSA learns QπQ^\pi — the value of the current policy
  • Q-learning learns QQ^* — the value of the optimal policy

What Does Off-Policy Mean?

📖Off-Policy Learning

Learning about one policy (the target policy) while following a different policy (the behavior policy). In Q-learning, the target policy is greedy with respect to Q, while the behavior policy can be anything that ensures sufficient exploration.

Think of off-policy learning like this:

On-policy (SARSA): “I’m learning how good my actions are, given that I’ll keep behaving the same way—including my mistakes and random exploration.”

Off-policy (Q-learning): “I’m learning how good actions would be if I always acted optimally afterwards—even though I’m not actually acting optimally right now.”

Q-learning separates learning from behavior. You can follow a wildly exploratory policy, even a random one, and still learn the optimal Q-values.

Why Is This Powerful?

Off-policy learning gives us several advantages:

Learn the Optimal Policy

Q-learning converges to QQ^* regardless of what policy you follow (given sufficient exploration).

Learn from Any Data

You can learn from data collected by other policies, human demonstrations, or even random exploration.

Experience Replay

You can store transitions and learn from them multiple times—key for sample efficiency in deep RL.

Flexible Exploration

The behavior policy can be anything: epsilon-greedy, Boltzmann, UCB, or even human control.

The Two Policies

In off-policy learning, we maintain two distinct policies:

Behavior Policy (μ\mu): The policy we actually follow. This is what generates our experience. It must explore to ensure we visit all important state-action pairs.

Target Policy (π\pi): The policy we’re learning about. In Q-learning, this is the greedy policy: π(s)=argmaxaQ(s,a)\pi(s) = \arg\max_a Q(s, a)

We follow μ\mu but learn about π\pi. The Q-values we learn reflect what would happen if we followed π\pi, not μ\mu.

📌Example

The Power of Off-Policy Learning

Imagine you’re training a robot to navigate a warehouse. You have:

  • Data from human operators (who sometimes make suboptimal choices)
  • Data from a previous, imperfect navigation system
  • Data from safe, random exploration

With SARSA (on-policy), you’d need to collect fresh data for every policy you evaluate.

With Q-learning (off-policy), you can:

  1. Pool all your data together
  2. Learn from all of it simultaneously
  3. Converge to the optimal policy even though none of your data came from that policy

This is why off-policy learning is so valuable for real-world applications.

The Intuition Behind Max

Why does using max\max instead of AA' lead to learning QQ^*?

The key is what the TD target represents:

SARSA’s target: Rt+1+γQ(St+1,At+1)R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})

  • “The reward I got, plus what I expect if I keep following my current policy”

Q-learning’s target: Rt+1+γmaxaQ(St+1,a)R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a')

  • “The reward I got, plus what I expect if I act optimally from now on”

By always asking “what if I were optimal?”, Q-learning learns the value of being optimal—QQ^*.

The actual behavior doesn’t matter (as long as we explore enough) because we’re always updating toward optimal future behavior.

Mathematical Details

The Bellman optimality equation for Q* is:

Q(s,a)=E[Rt+1+γmaxaQ(St+1,a)St=s,At=a]Q^*(s, a) = \mathbb{E}\left[R_{t+1} + \gamma \max_{a'} Q^*(S_{t+1}, a') \mid S_t = s, A_t = a\right]

Q-learning is essentially sampling this equation:

  • We observe a transition (s,a,r,s)(s, a, r, s')
  • We use r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a') as our estimate of what the right-hand side should be
  • We update Q(s,a)Q(s, a) toward this target

Over time, with enough samples and the right conditions, QQ converges to QQ^*.

The Trade-Off: Optimality vs. Safety

We’ll explore this trade-off in detail when we compare SARSA and Q-learning on Cliff Walking.

Historical Note

ℹ️Note

Q-learning was developed by Chris Watkins in his 1989 PhD thesis. The name “Q” was chosen simply as a symbol for the action-value function, following the convention of using VV for state values. Watkins’ proof that Q-learning converges to optimal values was a landmark result that established the theoretical foundations of model-free reinforcement learning.

Summary

The Q-learning idea is beautifully simple:

  1. Use max instead of the actual next action
  2. This separates behavior from learning — you can explore freely while learning optimal values
  3. Q-learning converges to QQ^* — the optimal action-value function
  4. Off-policy learning enables — learning from any data, experience replay, and flexible exploration

In the next section, we’ll see the complete Q-learning algorithm and implement it step by step.