Temporal Difference Learning • Part 3 of 3
📝Draft

TD vs Monte Carlo

Bias, variance, and sample efficiency

TD vs Monte Carlo

We’ve now seen two fundamentally different approaches to learning value functions from experience:

  • Monte Carlo (MC): Wait until the episode ends, then update using the actual return
  • Temporal Difference (TD): Update after each step using an estimated return

Both converge to the same answer—the true value function VπV^\pi—but they take very different paths to get there. Understanding when to use which is crucial for practical RL.

The Core Difference

The key distinction is what we use as the “target” for learning:

Monte Carlo target: The actual return Gt=Rt+1+γRt+2+γ2Rt+3+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots

TD target: The estimated return Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})

MC uses real experience all the way to the end. TD uses one step of real experience, then substitutes its own estimate.

Monte Carlo

  • Waits for episode to end
  • Uses actual return GtG_t
  • Unbiased estimates
  • High variance (many random events)
  • Requires episodic tasks
  • Updates at episode end

TD Learning

  • Updates every step
  • Uses TD target Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})
  • Biased estimates (bootstrapping)
  • Low variance (single transition)
  • Works for continuing tasks
  • Updates immediately

The Bias-Variance Tradeoff

This is perhaps the most important concept in comparing TD and MC.

📖Bias

A learning method is biased if its expected estimate differs from the true value. TD is biased because it bootstraps from V(St+1)V(S_{t+1}), which is itself an estimate. If our estimates are wrong, we’re learning from incorrect targets.

📖Variance

Variance measures how much an estimate fluctuates across different samples. MC has high variance because the return GtG_t depends on many random transitions. TD has low variance because it only depends on a single transition.

Mathematical Details

Let’s be more precise. Consider estimating V(s)V(s):

MC Estimator: V(s)1ni=1nGt(i)V(s) \approx \frac{1}{n}\sum_{i=1}^{n} G_t^{(i)}

The MC estimator is the average of actual returns from state ss. This is an unbiased estimator of Vπ(s)V^\pi(s) because E[GtSt=s]=Vπ(s)\mathbb{E}[G_t | S_t = s] = V^\pi(s).

However, each GtG_t is a sum of many random variables (all future rewards), so the variance can be very large, especially for long episodes.

TD Estimator: Updates toward Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})

This is biased because we’re using V(St+1)V(S_{t+1}) instead of the true value Vπ(St+1)V^\pi(S_{t+1}). But the variance is lower because we’re only sampling a single transition.

The key insight: bias is often fixable (estimates improve over time), but variance can severely slow learning. This is why TD often outperforms MC despite being biased.

A Tale of Two Updates

📌Example

Comparing Updates on the Same Experience

An agent starts in state A, transitions to B (reward 0), then to C (reward 0), then to terminal (reward +10). Assume γ=1.0\gamma = 1.0, α=0.1\alpha = 0.1.

Current estimates: V(A)=5V(A) = 5, V(B)=5V(B) = 5, V(C)=5V(C) = 5.

Monte Carlo update for state A:

  • Actual return: G=0+0+10=10G = 0 + 0 + 10 = 10
  • Update: V(A)5+0.1×(105)=5.5V(A) \leftarrow 5 + 0.1 \times (10 - 5) = 5.5

TD(0) updates (step by step):

  • At A→B: V(A)5+0.1×(0+55)=5V(A) \leftarrow 5 + 0.1 \times (0 + 5 - 5) = 5 (no change)
  • At B→C: V(B)5+0.1×(0+55)=5V(B) \leftarrow 5 + 0.1 \times (0 + 5 - 5) = 5 (no change)
  • At C→terminal: V(C)5+0.1×(10+05)=5.5V(C) \leftarrow 5 + 0.1 \times (10 + 0 - 5) = 5.5

After one episode, MC updated A directly toward the correct value. TD only updated C, and needs more episodes for that information to propagate back to A and B.

But what if the episode had been 1000 steps long with lots of randomness? MC would use all that noise in its update. TD would break it into manageable pieces.

When MC Beats TD

Monte Carlo has advantages in certain situations:

1. Very noisy bootstrapping targets

If your value estimates are very wrong, TD’s bias can hurt. MC doesn’t depend on value estimates, so it eventually gets the right answer.

2. Simple problems with short episodes

If episodes are short and returns are not too noisy, MC’s unbiasedness shines without too much variance.

3. When you need guaranteed unbiasedness

Some applications (like certain policy gradient methods) theoretically require unbiased estimates.

4. Batch settings with complete episodes

If you have a fixed dataset of complete episodes and want to extract as much signal as possible, MC can be preferable.

When TD Beats MC

TD has advantages in more situations:

1. Online learning

TD can update after every step. MC must wait for the episode to end. For learning during interaction, TD is essential.

2. Continuing tasks

Tasks that never end (like keeping a robot balanced forever) can’t use MC at all. TD handles them naturally.

3. Long episodes with high variance

If episodes are long and returns are noisy, MC’s variance becomes crippling. TD’s step-by-step updates are more stable.

4. Fast credit assignment

Information about rewards propagates backward through TD updates. MC only assigns credit at the end of an episode.

5. Limited data

TD is often more sample-efficient because it extracts more learning from each transition.

The Random Walk Experiment

The classic comparison is on the Random Walk problem we saw earlier.

📌Example

TD vs MC on Random Walk

In this experiment (from Sutton & Barto), both methods try to estimate the true value function for the 5-state random walk.

Key findings:

  • TD converges faster than MC for most learning rates
  • MC eventually catches up but takes more episodes
  • TD’s advantage grows when starting from poor initial estimates

This experiment demonstrates TD’s practical superiority in a controlled setting, even though MC is unbiased.

</>Implementation

Here’s code to compare TD and MC on the Random Walk:

import numpy as np
from collections import defaultdict

class RandomWalk:
    """5-state random walk environment."""

    def __init__(self):
        self.states = [1, 2, 3, 4, 5]  # Named by position
        self.state = 3  # Start in middle

    def reset(self):
        self.state = 3
        return self.state

    def step(self, action=None):
        """Move randomly left or right."""
        if np.random.random() < 0.5:
            self.state -= 1
        else:
            self.state += 1

        if self.state == 0:
            return 0, 0, True, {}  # Left terminal, reward 0
        elif self.state == 6:
            return 6, 1, True, {}  # Right terminal, reward 1
        else:
            return self.state, 0, False, {}


def td_0_episode(env, V, alpha=0.1, gamma=1.0):
    """Run one episode of TD(0)."""
    state = env.reset()
    done = False

    while not done:
        next_state, reward, done, _ = env.step()

        if done:
            td_target = reward
        else:
            td_target = reward + gamma * V[next_state]

        V[state] += alpha * (td_target - V[state])
        state = next_state


def mc_episode(env, V, alpha=0.1, gamma=1.0):
    """Run one episode of Monte Carlo."""
    state = env.reset()
    episode = [(state, None)]  # (state, reward) pairs
    done = False

    # Generate complete episode
    while not done:
        next_state, reward, done, _ = env.step()
        episode.append((next_state, reward))

    # Compute returns and update (backward)
    G = 0
    for t in range(len(episode) - 2, -1, -1):
        state, _ = episode[t]
        _, reward = episode[t + 1]
        G = reward + gamma * G
        V[state] += alpha * (G - V[state])


def run_comparison(num_runs=100, num_episodes=100):
    """Compare TD and MC over multiple runs."""
    true_V = {1: 1/6, 2: 2/6, 3: 3/6, 4: 4/6, 5: 5/6}

    td_errors = np.zeros((num_runs, num_episodes))
    mc_errors = np.zeros((num_runs, num_episodes))

    for run in range(num_runs):
        # Initialize value estimates
        V_td = defaultdict(lambda: 0.5)
        V_mc = defaultdict(lambda: 0.5)

        env_td = RandomWalk()
        env_mc = RandomWalk()

        for ep in range(num_episodes):
            td_0_episode(env_td, V_td, alpha=0.1)
            mc_episode(env_mc, V_mc, alpha=0.1)

            # Compute RMS error
            td_rms = np.sqrt(np.mean([(V_td[s] - true_V[s])**2 for s in true_V]))
            mc_rms = np.sqrt(np.mean([(V_mc[s] - true_V[s])**2 for s in true_V]))

            td_errors[run, ep] = td_rms
            mc_errors[run, ep] = mc_rms

    return td_errors.mean(axis=0), mc_errors.mean(axis=0)

# Run and plot
td_err, mc_err = run_comparison()

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(td_err, label='TD(0)', color='cyan')
plt.plot(mc_err, label='Monte Carlo', color='blue')
plt.xlabel('Episodes')
plt.ylabel('RMS Error (averaged over states)')
plt.title('TD vs MC on Random Walk')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Deeper Analysis

Mathematical Details

Why does TD have lower variance?

The MC return is: Gt=Rt+1+γRt+2+γ2Rt+3++γTt1RTG_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T

Each RiR_i is random, so GtG_t is a sum of many random variables. Its variance grows with episode length.

The TD target is: Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})

This depends on only one random reward Rt+1R_{t+1} and one random state St+1S_{t+1}. The variance is bounded, regardless of episode length.

Why is TD biased?

E[Rt+1+γV(St+1)St=s]=E[Rt+1+γVπ(St+1)St=s]\mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s] = \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t = s] only if V=VπV = V^\pi.

When VVπV \neq V^\pi (which is always true during learning), the TD target is biased. The bias disappears as VVπV \to V^\pi, which is why TD still converges correctly.

Common Misconceptions

The Spectrum Between TD and MC

TD(0) and MC are extremes of a spectrum:

  • TD(0): Use 1 step of real reward, then bootstrap
  • TD(1): Use 2 steps of real reward, then bootstrap
  • TD(n): Use n+1 steps of real reward, then bootstrap
  • TD(∞) = MC: Use all real rewards, no bootstrapping

The sweet spot often lies somewhere in between. TD(λ) provides a way to blend these via eligibility traces, effectively averaging over all n-step returns.

For now, the key insight is: TD(0) is not the only option, and neither is MC. Both are special cases of a broader family.

Practical Recommendations

Based on the tradeoffs, here are guidelines for choosing between TD and MC:

💡Tip

Use TD when:

  • You need online learning (updating during episodes)
  • Tasks are continuing (no natural episode end)
  • Episodes are long and returns are noisy
  • You want faster credit assignment
  • Sample efficiency matters
💡Tip

Use MC when:

  • Episodes are short and simple
  • You have complete episodes in a batch
  • You need guaranteed unbiased estimates
  • Bootstrap estimates would be very inaccurate
  • You’re using policy gradient methods that require returns

In practice, TD methods (and their extensions to control: SARSA, Q-learning) dominate. The ability to learn online, handle continuing tasks, and extract more from each sample makes TD the workhorse of modern RL.

Summary

AspectTDMC
BiasYes (from bootstrapping)No
VarianceLow (single step)High (full episode)
Online learningYesNo
Continuing tasksYesNo
Sample efficiencyGenerally higherGenerally lower
ImplementationSlightly simplerNeed to store episodes
ConvergenceSame fixed pointSame fixed point

Both TD and MC are fundamental tools in the RL toolkit. TD’s practical advantages—especially for control problems with large state spaces—make it the foundation for most modern algorithms.

In the next chapter, we’ll extend TD learning from prediction (estimating VπV^\pi) to control (finding a good policy). That’s where SARSA comes in.