Temporal Difference Learning • Part 2 of 4
📝Draft

The Q-Learning Algorithm

The most famous RL algorithm

The Q-Learning Algorithm

Q-learning is arguably the most important algorithm in reinforcement learning. It’s simple to implement, powerful in practice, and forms the foundation for modern deep RL methods like DQN. Let’s understand it completely.

The Q-Learning Update Rule

📖Q-Learning Update

The Q-learning algorithm updates action values using the maximum over next-state actions:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]

Let’s break down each component:

  1. Q(St,At)Q(S_t, A_t): Our current estimate of how good it is to take action AtA_t in state StS_t
  2. Rt+1R_{t+1}: The reward we just received
  3. maxaQ(St+1,a)\max_{a'} Q(S_{t+1}, a'): The value of the best action in the next state
  4. Rt+1+γmaxaQ(St+1,a)R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a'): The TD target — our new estimate of what Q(St,At)Q(S_t, A_t) should be
  5. α\alpha: Learning rate — how much to trust the new information

The update moves our estimate a fraction α\alpha toward the TD target.

Mathematical Details

TD Error for Q-Learning:

δt=Rt+1+γmaxaQ(St+1,a)Q(St,At)\delta_t = R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)

The TD error measures the “surprise”—the difference between our prediction and our revised estimate.

The update in terms of TD error:

Q(St,At)Q(St,At)+αδtQ(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \cdot \delta_t

Optimal Policy Extraction:

Once we have QQ^*, the optimal policy is simply:

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a)

The Complete Algorithm

Here’s the full Q-learning algorithm:

Algorithm: Q-Learning (Off-Policy TD Control)
──────────────────────────────────────────────
Parameters: step size α ∈ (0, 1], discount γ, exploration ε

Initialize Q(s, a) arbitrarily for all s, a
Initialize Q(terminal, ·) = 0

Loop for each episode:
    Initialize S

    Loop for each step of episode:
        Choose A from S using policy derived from Q (e.g., ε-greedy)
        Take action A, observe R, S'
        Q(S, A) ← Q(S, A) + α[R + γ max_a Q(S', a) - Q(S, A)]
        S ← S'
    until S is terminal
ℹ️Note

Notice the key difference from SARSA: we don’t need to choose the next action AA' before updating. Q-learning uses maxaQ(S,a)\max_{a'} Q(S', a')—we consider all possible next actions, not just the one we’ll actually take.

Comparison with SARSA

AspectSARSAQ-Learning
Next-state valueQ(S,A)Q(S', A')maxaQ(S,a)\max_{a'} Q(S', a')
Needs next action?Yes (before update)No
Policy learnedQπQ^\pi (current policy)QQ^* (optimal)
TypeOn-policyOff-policy
📌Example

Why Q-Learning Doesn’t Need A’

In SARSA, the episode loop looks like:

Choose A from S
Loop:
    Take action A, observe R, S'
    Choose A' from S'
    Update Q(S, A) using Q(S', A')
    S ← S', A ← A'

In Q-learning, it’s simpler:

Loop:
    Choose A from S
    Take action A, observe R, S'
    Update Q(S, A) using max Q(S', *)
    S ← S'

Q-learning doesn’t need to know what action we’ll take next—it considers all actions via max.

Complete Implementation

</>Implementation
import numpy as np
from collections import defaultdict

class QLearningAgent:
    """Q-learning agent for discrete state-action spaces."""

    def __init__(self, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        Initialize Q-learning agent.

        Args:
            n_actions: Number of available actions
            alpha: Learning rate
            gamma: Discount factor
            epsilon: Exploration rate for epsilon-greedy
        """
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

        # Q-table: maps state to array of action values
        self.Q = defaultdict(lambda: np.zeros(n_actions))

    def select_action(self, state):
        """Select action using epsilon-greedy policy."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q[state])

    def update(self, state, action, reward, next_state, done):
        """
        Perform Q-learning update.

        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Next state
            done: Whether episode ended

        Returns:
            td_error: The TD error for this transition
        """
        # Key difference from SARSA: use max over all next actions
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(self.Q[next_state])

        td_error = td_target - self.Q[state][action]
        self.Q[state][action] += self.alpha * td_error

        return td_error

    def get_greedy_policy(self):
        """Return the current greedy policy."""
        policy = {}
        for state in self.Q:
            policy[state] = np.argmax(self.Q[state])
        return policy


def train_qlearning_episode(env, agent):
    """
    Train one episode with Q-learning.

    Returns:
        total_reward: Sum of rewards in the episode
        steps: Number of steps taken
    """
    state = env.reset()
    total_reward = 0
    steps = 0
    done = False

    while not done:
        # Choose action
        action = agent.select_action(state)

        # Take action, observe result
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        steps += 1

        # Q-learning update (note: no next_action needed!)
        agent.update(state, action, reward, next_state, done)

        # Move to next state
        state = next_state

    return total_reward, steps


def train_qlearning(env, agent, num_episodes=1000, verbose=True):
    """Train Q-learning agent for multiple episodes."""
    rewards = []
    steps_list = []

    for episode in range(num_episodes):
        total_reward, steps = train_qlearning_episode(env, agent)
        rewards.append(total_reward)
        steps_list.append(steps)

        if verbose and episode % 100 == 0:
            avg_reward = np.mean(rewards[-100:]) if rewards else 0
            print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}")

    return rewards, steps_list

A Worked Example

📌Example

Q-Learning in GridWorld

Consider a 3x3 grid with goal at (2,2)(2,2). Initial Q-values are zero.

Step 1: State S=(0,0)S = (0,0), choose A=A = RIGHT (epsilon-greedy, exploring)

  • Observe: R=1R = -1, S=(0,1)S' = (0,1)
  • TD target: 1+0.99×maxaQ((0,1),a)=1+0.99×0=1-1 + 0.99 \times \max_a Q((0,1), a) = -1 + 0.99 \times 0 = -1
  • TD error: 10=1-1 - 0 = -1
  • Update: Q((0,0),RIGHT)0+0.1×(1)=0.1Q((0,0), RIGHT) \leftarrow 0 + 0.1 \times (-1) = -0.1

Step 2: State S=(0,1)S = (0,1), choose A=A = DOWN (epsilon-greedy)

  • Observe: R=1R = -1, S=(1,1)S' = (1,1)
  • TD target: 1+0.99×maxaQ((1,1),a)=1+0.99×0=1-1 + 0.99 \times \max_a Q((1,1), a) = -1 + 0.99 \times 0 = -1
  • Update: Q((0,1),DOWN)0+0.1×(1)=0.1Q((0,1), DOWN) \leftarrow 0 + 0.1 \times (-1) = -0.1

Eventually reaching goal:

  • State S=(2,1)S = (2,1), choose A=A = RIGHT
  • Observe: R=+10R = +10, S=(2,2)S' = (2,2) (terminal)
  • TD target: 1010 (no future value for terminal)
  • Update: Q((2,1),RIGHT)0+0.1×10=1.0Q((2,1), RIGHT) \leftarrow 0 + 0.1 \times 10 = 1.0

Now there’s a positive Q-value! In future episodes, this will propagate backward.

Key insight: Even though we explored randomly, the Q-values are converging toward optimal because we always use max in the target.

Epsilon-Greedy Exploration

Like SARSA, Q-learning typically uses epsilon-greedy for the behavior policy:

</>Implementation
def epsilon_greedy(Q, state, epsilon, n_actions):
    """
    Epsilon-greedy action selection.

    With probability epsilon: random action
    With probability 1-epsilon: greedy action
    """
    if np.random.random() < epsilon:
        return np.random.randint(n_actions)
    else:
        # Break ties randomly
        q_values = Q[state]
        max_q = np.max(q_values)
        best_actions = np.where(q_values == max_q)[0]
        return np.random.choice(best_actions)
💡Tip

Choosing Hyperparameters:

  • α\alpha (learning rate): Start with 0.1. Too high causes oscillation; too low causes slow learning.
  • γ\gamma (discount): 0.99 is standard. Lower values make the agent more short-sighted.
  • ϵ\epsilon (exploration): 0.1 is common. Consider decaying it: high early (explore), low later (exploit).

Epsilon Decay

</>Implementation
class DecayingEpsilonQLearning(QLearningAgent):
    """Q-learning with decaying exploration."""

    def __init__(self, n_actions, alpha=0.1, gamma=0.99,
                 epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995):
        super().__init__(n_actions, alpha, gamma, epsilon_start)
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay

    def decay_epsilon(self):
        """Call after each episode to decay epsilon."""
        self.epsilon = max(
            self.epsilon_end,
            self.epsilon * self.epsilon_decay
        )


# Usage
agent = DecayingEpsilonQLearning(n_actions=4)
for episode in range(1000):
    train_qlearning_episode(env, agent)
    agent.decay_epsilon()
    # epsilon goes: 1.0 → 0.995 → 0.990 → ... → 0.01

Why Does Q-Learning Work?

Q-learning works because of a clever property: the max operator is a contraction.

Every update moves Q toward the Bellman optimality equation: Q(s,a)=E[R+γmaxaQ(s,a)]Q^*(s,a) = \mathbb{E}\left[R + \gamma \max_{a'} Q^*(s', a')\right]

As long as we:

  1. Visit all state-action pairs enough times
  2. Decrease the learning rate appropriately (or keep it small)

…we’re guaranteed to converge to QQ^*. The max operation ensures we’re always aiming for optimal, not just following our current behavior.

Visualizing Q-Values

</>Implementation
import matplotlib.pyplot as plt

def visualize_q_values(agent, grid_size, action_names=['UP', 'DOWN', 'LEFT', 'RIGHT']):
    """Visualize Q-values as a heatmap with best actions."""
    fig, axes = plt.subplots(2, 2, figsize=(12, 12))

    for action_idx, (ax, action_name) in enumerate(zip(axes.flat, action_names)):
        # Create Q-value grid for this action
        q_grid = np.zeros((grid_size, grid_size))
        for row in range(grid_size):
            for col in range(grid_size):
                q_grid[row, col] = agent.Q[(row, col)][action_idx]

        im = ax.imshow(q_grid, cmap='RdYlGn', aspect='equal')
        ax.set_title(f'Q-values for {action_name}')
        ax.set_xlabel('Column')
        ax.set_ylabel('Row')
        plt.colorbar(im, ax=ax)

        # Add value annotations
        for row in range(grid_size):
            for col in range(grid_size):
                ax.text(col, row, f'{q_grid[row, col]:.1f}',
                       ha='center', va='center', fontsize=8)

    plt.tight_layout()
    plt.show()


def visualize_policy(agent, grid_size):
    """Visualize the greedy policy as arrows."""
    arrow_chars = {0: '^', 1: 'v', 2: '<', 3: '>'}

    print("Learned Policy:")
    for row in range(grid_size):
        row_str = ""
        for col in range(grid_size):
            best_action = np.argmax(agent.Q[(row, col)])
            row_str += arrow_chars[best_action] + " "
        print(row_str)

Common Mistakes to Avoid

Summary

Q-learning is remarkably simple:

  1. Observe state SS, choose action AA (epsilon-greedy)
  2. Take action, observe reward RR and next state SS'
  3. Update: Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma \max_{a'} Q(S', a') - Q(S, A)]
  4. Move to next state: SSS \leftarrow S'

The key insight: by using max\max, we learn about optimal behavior regardless of how we actually behave.

In the next section, we’ll compare SARSA and Q-learning head-to-head on the Cliff Walking environment—the canonical example that shows their different behaviors.