Chapter 111
📝Draft

Introduction to TD Learning

Learning from experience without waiting for the episode to end

Introduction to Temporal Difference Learning

What You'll Learn

  • Explain the TD learning idea: bootstrapping from estimates
  • Implement TD(0) for value prediction
  • Identify the TD error and its role in learning
  • Compare TD learning with Monte Carlo and Dynamic Programming
  • Explain bias-variance tradeoffs in TD vs MC

Monte Carlo methods wait until the end of an episode to learn. Dynamic programming needs a model of the environment. What if we could learn step-by-step, from experience, without a model? That’s TD learning—the heart of modern reinforcement learning.

TD(0) vs Monte Carlo Learning

Compare how TD and MC learn value estimates on the Random Walk problem.

Random Walk: Agent starts at S3, moves left or right with 50% probability. Left terminal gives 0 reward, right terminal gives +1 reward.
TD(0)
0
S1
0.50
S2
0.50
S3
0.50
S4
0.50
S5
0.50
+1
S1
S2
S3
S4
S5
Episodes: 0|RMSE: 0.236
Monte Carlo
0
S1
0.50
S2
0.50
S3
0.50
S4
0.50
S5
0.50
+1
S1
S2
S3
S4
S5
Episodes: 0|RMSE: 0.236
True Values (dashed lines)
S1: 0.17S2: 0.33S3: 0.50S4: 0.67S5: 0.83
Key Difference:
  • TD(0) updates after every step using bootstrap estimates (V(s) depends on V(s'))
  • Monte Carlo waits until episode end to update, using actual returns
Watch how TD values change immediately while MC values only update when an episode terminates. TD typically learns faster due to its online updates!

Why TD Learning Matters

Temporal Difference (TD) learning combines the best of both worlds:

  • Like Monte Carlo, it learns from experience (no model needed)
  • Like Dynamic Programming, it updates estimates from estimates (bootstrapping), allowing learning before an episode ends

This is the key insight that enables practical reinforcement learning. Nearly every successful RL algorithm—from Q-learning to DQN to PPO—builds on TD ideas.

📖Temporal Difference Learning

A class of RL methods that learn by comparing predictions made at different points in time. The “temporal difference” is the gap between successive predictions—if our value estimates are correct, this difference should be zero on average.

The TD Error

The TD error—the difference between what we expected and what we got—becomes our learning signal:

δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

Think of the TD error as a “surprise” signal:

  • If δt>0\delta_t > 0: Things went better than expected. Increase V(St)V(S_t).
  • If δt<0\delta_t < 0: Things went worse than expected. Decrease V(St)V(S_t).
  • If δt=0\delta_t = 0: Our prediction was perfect. No change needed.

This simple idea powers everything from tabular Q-learning to modern deep RL systems.

Chapter Overview

This chapter introduces TD learning—the foundation for Q-learning, SARSA, and all of modern deep RL. We’ll cover:

TD at a Glance

Mathematical Details

The core TD(0) update rule:

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]

Where:

  • V(St)V(S_t) is our current estimate of the state’s value
  • α\alpha is the learning rate
  • Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}) is the TD target
  • Rt+1+γV(St+1)V(St)R_{t+1} + \gamma V(S_{t+1}) - V(S_t) is the TD error δt\delta_t
</>Implementation
def td_update(V, state, reward, next_state, alpha=0.1, gamma=0.99, done=False):
    """Single TD(0) update."""
    if done:
        td_target = reward
    else:
        td_target = reward + gamma * V[next_state]

    td_error = td_target - V[state]
    V[state] += alpha * td_error
    return td_error

Prerequisites

This chapter assumes you’re comfortable with:

Key Questions We’ll Answer

  • How can we learn from incomplete episodes?
  • What is bootstrapping and why does it introduce bias?
  • How does the TD error act as a “surprise” signal?
  • When should you use TD over Monte Carlo?
  • How does information propagate through TD updates?

The Big Picture

TD learning represents a fundamental shift in how we think about learning from experience:

Dynamic Programming
Needs model
No complete episodes
Updates after computing all states
Monte Carlo
Model-free
Needs complete episodes
Updates at end of episode
TD Learning
Model-free
No complete episodes needed
Updates every step!

TD’s ability to learn incrementally, without a model, from incomplete episodes makes it the practical workhorse of reinforcement learning.


Key Takeaways

  • TD learning updates estimates using other estimates (bootstrapping)
  • The TD error δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) measures “surprise”
  • TD has lower variance than MC but introduces bias
  • TD can learn online, during an episode, without waiting for termination
  • TD works for continuing (non-episodic) tasks where MC cannot
  • The bias-variance tradeoff often favors TD in practice
Next ChapterSARSA