TD vs Monte Carlo

We’ve now seen two fundamentally different approaches to learning value functions from experience:

Monte Carlo (MC): Wait until the episode ends, then update using the actual return
Temporal Difference (TD): Update after each step using an estimated return

Both converge to the same answer—the true value function $V^\pi$ —but they take very different paths to get there. Understanding when to use which is crucial for practical RL.

The Core Difference

The key distinction is what we use as the “target” for learning:

Monte Carlo target: The actual return $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$

TD target: The estimated return $R_{t+1} + \gamma V(S_{t+1})$

MC uses real experience all the way to the end. TD uses one step of real experience, then substitutes its own estimate.

Monte Carlo

Waits for episode to end
Uses actual return $G_t$
Unbiased estimates
High variance (many random events)
Requires episodic tasks
Updates at episode end

TD Learning

Updates every step
Uses TD target $R_{t+1} + \gamma V(S_{t+1})$
Biased estimates (bootstrapping)
Low variance (single transition)
Works for continuing tasks
Updates immediately

The Bias-Variance Tradeoff

This is perhaps the most important concept in comparing TD and MC.

📖Bias

A learning method is biased if its expected estimate differs from the true value. TD is biased because it bootstraps from $V(S_{t+1})$ , which is itself an estimate. If our estimates are wrong, we’re learning from incorrect targets.

📖Variance

Variance measures how much an estimate fluctuates across different samples. MC has high variance because the return $G_t$ depends on many random transitions. TD has low variance because it only depends on a single transition.

∑Mathematical Details

Let’s be more precise. Consider estimating $V(s)$ :

MC Estimator: $V(s) \approx \frac{1}{n}\sum_{i=1}^{n} G_t^{(i)}$

The MC estimator is the average of actual returns from state $s$ . This is an unbiased estimator of $V^\pi(s)$ because $\mathbb{E}[G_t | S_t = s] = V^\pi(s)$ .

However, each $G_t$ is a sum of many random variables (all future rewards), so the variance can be very large, especially for long episodes.

TD Estimator: Updates toward $R_{t+1} + \gamma V(S_{t+1})$

This is biased because we’re using $V(S_{t+1})$ instead of the true value $V^\pi(S_{t+1})$ . But the variance is lower because we’re only sampling a single transition.

The key insight: bias is often fixable (estimates improve over time), but variance can severely slow learning. This is why TD often outperforms MC despite being biased.

A Tale of Two Updates

📌Example

Comparing Updates on the Same Experience

An agent starts in state A, transitions to B (reward 0), then to C (reward 0), then to terminal (reward +10). Assume $\gamma = 1.0$ , $\alpha = 0.1$ .

Current estimates: $V(A) = 5$ , $V(B) = 5$ , $V(C) = 5$ .

Monte Carlo update for state A:

Actual return: $G = 0 + 0 + 10 = 10$
Update: $V(A) \leftarrow 5 + 0.1 \times (10 - 5) = 5.5$

TD(0) updates (step by step):

At A→B: $V(A) \leftarrow 5 + 0.1 \times (0 + 5 - 5) = 5$ (no change)
At B→C: $V(B) \leftarrow 5 + 0.1 \times (0 + 5 - 5) = 5$ (no change)
At C→terminal: $V(C) \leftarrow 5 + 0.1 \times (10 + 0 - 5) = 5.5$

After one episode, MC updated A directly toward the correct value. TD only updated C, and needs more episodes for that information to propagate back to A and B.

But what if the episode had been 1000 steps long with lots of randomness? MC would use all that noise in its update. TD would break it into manageable pieces.

When MC Beats TD

Monte Carlo has advantages in certain situations:

1. Very noisy bootstrapping targets

If your value estimates are very wrong, TD’s bias can hurt. MC doesn’t depend on value estimates, so it eventually gets the right answer.

2. Simple problems with short episodes

If episodes are short and returns are not too noisy, MC’s unbiasedness shines without too much variance.

3. When you need guaranteed unbiasedness

Some applications (like certain policy gradient methods) theoretically require unbiased estimates.

4. Batch settings with complete episodes

If you have a fixed dataset of complete episodes and want to extract as much signal as possible, MC can be preferable.

When TD Beats MC

TD has advantages in more situations:

1. Online learning

TD can update after every step. MC must wait for the episode to end. For learning during interaction, TD is essential.

2. Continuing tasks

Tasks that never end (like keeping a robot balanced forever) can’t use MC at all. TD handles them naturally.

3. Long episodes with high variance

If episodes are long and returns are noisy, MC’s variance becomes crippling. TD’s step-by-step updates are more stable.

4. Fast credit assignment

Information about rewards propagates backward through TD updates. MC only assigns credit at the end of an episode.

5. Limited data

TD is often more sample-efficient because it extracts more learning from each transition.

The Random Walk Experiment

The classic comparison is on the Random Walk problem we saw earlier.

📌Example

TD vs MC on Random Walk

In this experiment (from Sutton & Barto), both methods try to estimate the true value function for the 5-state random walk.

Key findings:

TD converges faster than MC for most learning rates
MC eventually catches up but takes more episodes
TD’s advantage grows when starting from poor initial estimates

This experiment demonstrates TD’s practical superiority in a controlled setting, even though MC is unbiased.

</>Implementation

Here’s code to compare TD and MC on the Random Walk:

import numpy as np
from collections import defaultdict

class RandomWalk:
    """5-state random walk environment."""

    def __init__(self):
        self.states = [1, 2, 3, 4, 5]  # Named by position
        self.state = 3  # Start in middle

    def reset(self):
        self.state = 3
        return self.state

    def step(self, action=None):
        """Move randomly left or right."""
        if np.random.random() < 0.5:
            self.state -= 1
        else:
            self.state += 1

        if self.state == 0:
            return 0, 0, True, {}  # Left terminal, reward 0
        elif self.state == 6:
            return 6, 1, True, {}  # Right terminal, reward 1
        else:
            return self.state, 0, False, {}


def td_0_episode(env, V, alpha=0.1, gamma=1.0):
    """Run one episode of TD(0)."""
    state = env.reset()
    done = False

    while not done:
        next_state, reward, done, _ = env.step()

        if done:
            td_target = reward
        else:
            td_target = reward + gamma * V[next_state]

        V[state] += alpha * (td_target - V[state])
        state = next_state


def mc_episode(env, V, alpha=0.1, gamma=1.0):
    """Run one episode of Monte Carlo."""
    state = env.reset()
    episode = [(state, None)]  # (state, reward) pairs
    done = False

    # Generate complete episode
    while not done:
        next_state, reward, done, _ = env.step()
        episode.append((next_state, reward))

    # Compute returns and update (backward)
    G = 0
    for t in range(len(episode) - 2, -1, -1):
        state, _ = episode[t]
        _, reward = episode[t + 1]
        G = reward + gamma * G
        V[state] += alpha * (G - V[state])


def run_comparison(num_runs=100, num_episodes=100):
    """Compare TD and MC over multiple runs."""
    true_V = {1: 1/6, 2: 2/6, 3: 3/6, 4: 4/6, 5: 5/6}

    td_errors = np.zeros((num_runs, num_episodes))
    mc_errors = np.zeros((num_runs, num_episodes))

    for run in range(num_runs):
        # Initialize value estimates
        V_td = defaultdict(lambda: 0.5)
        V_mc = defaultdict(lambda: 0.5)

        env_td = RandomWalk()
        env_mc = RandomWalk()

        for ep in range(num_episodes):
            td_0_episode(env_td, V_td, alpha=0.1)
            mc_episode(env_mc, V_mc, alpha=0.1)

            # Compute RMS error
            td_rms = np.sqrt(np.mean([(V_td[s] - true_V[s])**2 for s in true_V]))
            mc_rms = np.sqrt(np.mean([(V_mc[s] - true_V[s])**2 for s in true_V]))

            td_errors[run, ep] = td_rms
            mc_errors[run, ep] = mc_rms

    return td_errors.mean(axis=0), mc_errors.mean(axis=0)

# Run and plot
td_err, mc_err = run_comparison()

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(td_err, label='TD(0)', color='cyan')
plt.plot(mc_err, label='Monte Carlo', color='blue')
plt.xlabel('Episodes')
plt.ylabel('RMS Error (averaged over states)')
plt.title('TD vs MC on Random Walk')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Deeper Analysis

∑Mathematical Details

Why does TD have lower variance?

The MC return is: $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T$

Each $R_i$ is random, so $G_t$ is a sum of many random variables. Its variance grows with episode length.

The TD target is: $R_{t+1} + \gamma V(S_{t+1})$

This depends on only one random reward $R_{t+1}$ and one random state $S_{t+1}$ . The variance is bounded, regardless of episode length.

Why is TD biased?

$\mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s] = \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t = s]$ only if $V = V^\pi$ .

When $V \neq V^\pi$ (which is always true during learning), the TD target is biased. The bias disappears as $V \to V^\pi$ , which is why TD still converges correctly.

Common Misconceptions

⚠️Warning

“TD’s bias means it converges to the wrong answer.”

No. TD is asymptotically unbiased—the bias goes to zero as estimates improve. Both TD and MC converge to the same correct answer; they just take different paths.

⚠️Warning

“Low variance is always better.”

Not always. If your estimates are very wrong, TD will happily reinforce those errors. MC is immune to this because it uses actual returns. In practice, TD usually wins, but not universally.

⚠️Warning

“You have to choose between TD and MC.”

Not true! TD(λ) and n-step TD methods interpolate between them. You can get the benefits of both. Many practical algorithms use a mix.

The Spectrum Between TD and MC

TD(0) and MC are extremes of a spectrum:

TD(0): Use 1 step of real reward, then bootstrap
TD(1): Use 2 steps of real reward, then bootstrap
TD(n): Use n+1 steps of real reward, then bootstrap
TD(∞) = MC: Use all real rewards, no bootstrapping

The sweet spot often lies somewhere in between. TD(λ) provides a way to blend these via eligibility traces, effectively averaging over all n-step returns.

For now, the key insight is: TD(0) is not the only option, and neither is MC. Both are special cases of a broader family.

Practical Recommendations

Based on the tradeoffs, here are guidelines for choosing between TD and MC:

💡Tip

Use TD when:

You need online learning (updating during episodes)
Tasks are continuing (no natural episode end)
Episodes are long and returns are noisy
You want faster credit assignment
Sample efficiency matters

💡Tip

Use MC when:

Episodes are short and simple
You have complete episodes in a batch
You need guaranteed unbiased estimates
Bootstrap estimates would be very inaccurate
You’re using policy gradient methods that require returns

In practice, TD methods (and their extensions to control: SARSA, Q-learning) dominate. The ability to learn online, handle continuing tasks, and extract more from each sample makes TD the workhorse of modern RL.

Summary

Aspect	TD	MC
Bias	Yes (from bootstrapping)	No
Variance	Low (single step)	High (full episode)
Online learning	Yes	No
Continuing tasks	Yes	No
Sample efficiency	Generally higher	Generally lower
Implementation	Slightly simpler	Need to store episodes
Convergence	Same fixed point	Same fixed point

Both TD and MC are fundamental tools in the RL toolkit. TD’s practical advantages—especially for control problems with large state spaces—make it the foundation for most modern algorithms.

In the next chapter, we’ll extend TD learning from prediction (estimating $V^\pi$ ) to control (finding a good policy). That’s where SARSA comes in.