The Q-Learning Idea

Q-learning is perhaps the most influential algorithm in reinforcement learning. It’s the foundation of DQN, which famously learned to play Atari games, and its descendants power many modern AI systems. The core idea is elegant: learn the optimal value function directly, regardless of how you’re actually behaving.

From SARSA to Q-Learning

Recall SARSA’s approach:

Take action $A$ in state $S$
Observe reward $R$ and next state $S'$
Choose next action $A'$ from your policy
Update $Q(S, A)$ using $Q(S', A')$

SARSA uses the action you’ll actually take next. This means if your policy sometimes explores randomly, the Q-values account for that exploration.

Q-learning changes one thing: instead of using the next action $A'$ , use the best action:

$\max_{a'} Q(S', a')$

This single change transforms the algorithm from on-policy to off-policy.

The One-Word Difference

∑Mathematical Details

SARSA (on-policy): $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]$

Q-Learning (off-policy): $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]$

The only difference: max replaces the actual next action $A_{t+1}$ .

But this small change has profound implications:

SARSA learns $Q^\pi$ — the value of the current policy
Q-learning learns $Q^*$ — the value of the optimal policy

What Does Off-Policy Mean?

📖Off-Policy Learning

Learning about one policy (the target policy) while following a different policy (the behavior policy). In Q-learning, the target policy is greedy with respect to Q, while the behavior policy can be anything that ensures sufficient exploration.

Think of off-policy learning like this:

On-policy (SARSA): “I’m learning how good my actions are, given that I’ll keep behaving the same way—including my mistakes and random exploration.”

Off-policy (Q-learning): “I’m learning how good actions would be if I always acted optimally afterwards—even though I’m not actually acting optimally right now.”

Q-learning separates learning from behavior. You can follow a wildly exploratory policy, even a random one, and still learn the optimal Q-values.

Why Is This Powerful?

Off-policy learning gives us several advantages:

Learn the Optimal Policy

Q-learning converges to $Q^*$ regardless of what policy you follow (given sufficient exploration).

Learn from Any Data

You can learn from data collected by other policies, human demonstrations, or even random exploration.

Experience Replay

You can store transitions and learn from them multiple times—key for sample efficiency in deep RL.

Flexible Exploration

The behavior policy can be anything: epsilon-greedy, Boltzmann, UCB, or even human control.

The Two Policies

In off-policy learning, we maintain two distinct policies:

Behavior Policy ( $\mu$ ): The policy we actually follow. This is what generates our experience. It must explore to ensure we visit all important state-action pairs.

Target Policy ( $\pi$ ): The policy we’re learning about. In Q-learning, this is the greedy policy: $\pi(s) = \arg\max_a Q(s, a)$

We follow $\mu$ but learn about $\pi$ . The Q-values we learn reflect what would happen if we followed $\pi$ , not $\mu$ .

📌Example

The Power of Off-Policy Learning

Imagine you’re training a robot to navigate a warehouse. You have:

Data from human operators (who sometimes make suboptimal choices)
Data from a previous, imperfect navigation system
Data from safe, random exploration

With SARSA (on-policy), you’d need to collect fresh data for every policy you evaluate.

With Q-learning (off-policy), you can:

Pool all your data together
Learn from all of it simultaneously
Converge to the optimal policy even though none of your data came from that policy

This is why off-policy learning is so valuable for real-world applications.

The Intuition Behind Max

Why does using $\max$ instead of $A'$ lead to learning $Q^*$ ?

The key is what the TD target represents:

SARSA’s target: $R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$

“The reward I got, plus what I expect if I keep following my current policy”

Q-learning’s target: $R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a')$

“The reward I got, plus what I expect if I act optimally from now on”

By always asking “what if I were optimal?”, Q-learning learns the value of being optimal— $Q^*$ .

The actual behavior doesn’t matter (as long as we explore enough) because we’re always updating toward optimal future behavior.

∑Mathematical Details

The Bellman optimality equation for Q* is:

$Q^*(s, a) = \mathbb{E}\left[R_{t+1} + \gamma \max_{a'} Q^*(S_{t+1}, a') \mid S_t = s, A_t = a\right]$

Q-learning is essentially sampling this equation:

We observe a transition $(s, a, r, s')$
We use $r + \gamma \max_{a'} Q(s', a')$ as our estimate of what the right-hand side should be
We update $Q(s, a)$ toward this target

Over time, with enough samples and the right conditions, $Q$ converges to $Q^*$ .

The Trade-Off: Optimality vs. Safety

⚠️Warning

Q-learning learns optimal values, but there’s a catch: it ignores the exploration risk.

When we use $\max_{a'} Q(s', a')$ , we’re assuming we’ll take the best action next time. But if we’re actually exploring with epsilon-greedy, we might take a random (possibly bad) action.

This makes Q-learning learn riskier strategies than SARSA—strategies that assume perfect future behavior. In the Cliff Walking environment, this leads Q-learning right along the cliff edge, where random exploration can be fatal.

We’ll explore this trade-off in detail when we compare SARSA and Q-learning on Cliff Walking.

Historical Note

ℹ️Note

Q-learning was developed by Chris Watkins in his 1989 PhD thesis. The name “Q” was chosen simply as a symbol for the action-value function, following the convention of using $V$ for state values. Watkins’ proof that Q-learning converges to optimal values was a landmark result that established the theoretical foundations of model-free reinforcement learning.

Summary

The Q-learning idea is beautifully simple:

Use max instead of the actual next action
This separates behavior from learning — you can explore freely while learning optimal values
Q-learning converges to $Q^*$ — the optimal action-value function
Off-policy learning enables — learning from any data, experience replay, and flexible exploration

In the next section, we’ll see the complete Q-learning algorithm and implement it step by step.