Advanced Topics • Part 3 of 4
📝Draft

Dyna Architecture

Combining real and simulated experience

The Dyna Architecture

Dyna is one of the most elegant ideas in model-based RL: combine model-free learning with model-based planning in a unified framework. After each real experience, the agent also generates simulated experiences from its learned model—amplifying every real interaction many times over.

The Dyna Idea

📖Dyna Architecture

Dyna integrates three processes:

  1. Direct RL: Learn from real experience (e.g., Q-learning updates)
  2. Model Learning: Build and maintain a model of the environment
  3. Planning: Use the model to generate simulated experiences for additional learning

All three happen after each real environment step, creating a virtuous cycle of learning and planning.

Imagine you’re learning to navigate a new city. You take a walk (real experience) and remember what you saw: “From the coffee shop, going left leads to the park.” Later, sitting at home, you mentally replay your walk and plan alternative routes (simulated experience): “I wonder what would happen if I had turned right at the bakery…”

Dyna does exactly this: learn from reality, remember what happened, then amplify that learning through imagination.

The beauty is that you can do many imaginary walks for every real one. Each real experience teaches you something directly AND feeds into your mental model for many future imaginary experiences.

Dyna-Q: The Algorithm

The classic Dyna algorithm combines tabular Q-learning with a simple deterministic model.

Mathematical Details

After each real transition (s,a)(r,s)(s, a) \rightarrow (r, s'):

Step 1: Direct RL - Update Q from real experience: Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Step 2: Model Learning - Remember what happened: Model(s,a)(r,s)\text{Model}(s, a) \leftarrow (r, s')

Step 3: Planning - For nn iterations:

  1. Sample a previously visited (ssim,asim)(s_{\text{sim}}, a_{\text{sim}}) pair
  2. Query the model: (rsim,ssim)=Model(ssim,asim)(r_{\text{sim}}, s'_{\text{sim}}) = \text{Model}(s_{\text{sim}}, a_{\text{sim}})
  3. Update Q from simulated experience: Q(ssim,asim)Q(ssim,asim)+α[rsim+γmaxaQ(ssim,a)Q(ssim,asim)]Q(s_{\text{sim}}, a_{\text{sim}}) \leftarrow Q(s_{\text{sim}}, a_{\text{sim}}) + \alpha[r_{\text{sim}} + \gamma \max_{a'} Q(s'_{\text{sim}}, a') - Q(s_{\text{sim}}, a_{\text{sim}})]

The update equations are identical—only the source of experience differs. Real experience comes from the environment; simulated experience comes from the model.

</>Implementation
import numpy as np
from collections import defaultdict

class DynaQ:
    """
    Dyna-Q: Q-learning with model-based planning.

    Combines direct RL, model learning, and planning in one algorithm.
    """

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99,
                 epsilon=0.1, planning_steps=10):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.planning_steps = planning_steps

        # Q-table
        self.Q = np.zeros((n_states, n_actions))

        # Model: (state, action) -> (reward, next_state)
        self.model = {}

        # Track which state-action pairs we've experienced
        self.experienced = []

    def select_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.Q[state])

    def update(self, s, a, r, s_prime, done):
        """
        Full Dyna-Q update: direct RL + model learning + planning.
        """
        # Step 1: Direct RL - learn from real experience
        if done:
            target = r
        else:
            target = r + self.gamma * np.max(self.Q[s_prime])
        self.Q[s, a] += self.alpha * (target - self.Q[s, a])

        # Step 2: Model learning - remember this transition
        self.model[(s, a)] = (r, s_prime, done)
        if (s, a) not in self.experienced:
            self.experienced.append((s, a))

        # Step 3: Planning - simulate and learn from model
        for _ in range(self.planning_steps):
            # Sample a previously experienced state-action pair
            s_sim, a_sim = self.experienced[np.random.randint(len(self.experienced))]

            # Query the model
            r_sim, s_prime_sim, done_sim = self.model[(s_sim, a_sim)]

            # Q-learning update on simulated experience
            if done_sim:
                target_sim = r_sim
            else:
                target_sim = r_sim + self.gamma * np.max(self.Q[s_prime_sim])
            self.Q[s_sim, a_sim] += self.alpha * (target_sim - self.Q[s_sim, a_sim])


def train_dyna_q(env, agent, episodes=500):
    """Training loop for Dyna-Q."""
    returns = []

    for episode in range(episodes):
        state = env.reset()
        episode_return = 0
        done = False

        while not done:
            action = agent.select_action(state)
            next_state, reward, done, _ = env.step(action)

            # Dyna-Q update (includes planning!)
            agent.update(state, action, reward, next_state, done)

            episode_return += reward
            state = next_state

        returns.append(episode_return)

        if episode % 100 == 0:
            print(f"Episode {episode}, Return: {np.mean(returns[-100:]):.2f}")

    return returns

How Many Planning Steps?

A key hyperparameter in Dyna is nn: how many planning steps to take after each real step. This creates a sample efficiency vs compute tradeoff.

Think about it this way:

  • n = 0: Pure model-free Q-learning. No imagination, just reality.
  • n = 5: For every real step, imagine 5 additional steps. 6x the learning per real interaction.
  • n = 50: For every real step, imagine 50 additional steps. Lots of imagination!

More planning means faster learning from fewer real interactions—but each step takes more compute, and planning is only useful if the model is accurate.

📌The Impact of Planning Steps

Consider learning to navigate a 10x10 GridWorld maze:

Planning StepsEpisodes to SolveReal Environment Steps
0 (pure Q-learning)~1500~30,000
5~200~4,000
50~50~1,000

With 50 planning steps, we solve the maze with 30x fewer real interactions! The model, once learned, provides essentially unlimited free learning signal.

Mathematical Details

The sample efficiency improvement is roughly linear in planning steps (for a good model). If each real step produces n+1n+1 learning updates (1 real + nn simulated), we need roughly 1n+1\frac{1}{n+1} as many real samples to achieve the same learning.

However, this assumes the model is perfect. With model errors:

  • Simulated updates may be wrong
  • More planning can actually hurt if the model is very inaccurate
  • There’s a sweet spot depending on model quality

Dyna with Stochastic Models

The basic Dyna uses a deterministic model. For stochastic environments, we need to handle uncertainty.

</>Implementation
class DynaQStochastic:
    """
    Dyna-Q with stochastic model support.

    Learns transition probabilities from counts.
    """

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99,
                 epsilon=0.1, planning_steps=10):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.planning_steps = planning_steps

        self.Q = np.zeros((n_states, n_actions))

        # Stochastic model: count-based
        self.transition_counts = np.zeros((n_states, n_actions, n_states))
        self.reward_sums = np.zeros((n_states, n_actions))
        self.visit_counts = np.zeros((n_states, n_actions))

    def update_model(self, s, a, r, s_prime):
        """Update the stochastic model."""
        self.transition_counts[s, a, s_prime] += 1
        self.reward_sums[s, a] += r
        self.visit_counts[s, a] += 1

    def sample_from_model(self, s, a):
        """Sample a transition from the learned model."""
        if self.visit_counts[s, a] == 0:
            return None, None

        # Sample next state from learned transition probabilities
        probs = self.transition_counts[s, a] / self.visit_counts[s, a]
        s_prime = np.random.choice(self.n_states, p=probs)

        # Expected reward
        r = self.reward_sums[s, a] / self.visit_counts[s, a]

        return r, s_prime

    def update(self, s, a, r, s_prime, done):
        """Full Dyna-Q update with stochastic model."""
        # Direct RL
        target = r if done else r + self.gamma * np.max(self.Q[s_prime])
        self.Q[s, a] += self.alpha * (target - self.Q[s, a])

        # Model learning
        self.update_model(s, a, r, s_prime)

        # Planning
        for _ in range(self.planning_steps):
            # Sample previously visited state-action
            visited = np.where(self.visit_counts > 0)
            if len(visited[0]) == 0:
                break
            idx = np.random.randint(len(visited[0]))
            s_sim, a_sim = visited[0][idx], visited[1][idx]

            # Sample from stochastic model
            r_sim, s_prime_sim = self.sample_from_model(s_sim, a_sim)
            if r_sim is None:
                continue

            # Update
            target_sim = r_sim + self.gamma * np.max(self.Q[s_prime_sim])
            self.Q[s_sim, a_sim] += self.alpha * (target_sim - self.Q[s_sim, a_sim])

Dyna-Q+ and Changing Environments

What happens when the environment changes? The model becomes outdated, and planning with an old model can be harmful.

Dyna-Q+ addresses this by adding a bonus for exploration. The longer it’s been since we tried a state-action pair, the more optimistic we become about it. This encourages revisiting old experiences to check if anything has changed.

It’s like periodically revisiting old restaurants to see if they’ve improved, rather than always going to your current favorite.

Mathematical Details

In Dyna-Q+, planning updates use an augmented reward:

r+=r+κτr^+ = r + \kappa \sqrt{\tau}

where τ\tau is the number of time steps since (s,a)(s, a) was last tried in the real environment, and κ\kappa is a small constant (e.g., 0.001).

This bonus encourages exploration of state-action pairs that haven’t been visited recently, helping detect environmental changes.

</>Implementation
class DynaQPlus:
    """
    Dyna-Q+ with exploration bonus for changing environments.
    """

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99,
                 epsilon=0.1, planning_steps=10, kappa=0.001):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.planning_steps = planning_steps
        self.kappa = kappa  # Exploration bonus coefficient

        self.Q = np.zeros((n_states, n_actions))
        self.model = {}
        self.time_since_visit = np.zeros((n_states, n_actions))
        self.current_time = 0

    def update(self, s, a, r, s_prime, done):
        """Dyna-Q+ update with exploration bonus."""
        self.current_time += 1

        # Direct RL
        target = r if done else r + self.gamma * np.max(self.Q[s_prime])
        self.Q[s, a] += self.alpha * (target - self.Q[s, a])

        # Model learning
        self.model[(s, a)] = (r, s_prime, done)
        self.time_since_visit[s, a] = self.current_time

        # Planning with exploration bonus
        for _ in range(self.planning_steps):
            # Sample from all state-action pairs (not just experienced)
            s_sim = np.random.randint(self.n_states)
            a_sim = np.random.randint(self.n_actions)

            if (s_sim, a_sim) in self.model:
                r_sim, s_prime_sim, done_sim = self.model[(s_sim, a_sim)]

                # Time since last real visit
                tau = self.current_time - self.time_since_visit[s_sim, a_sim]

                # Add exploration bonus
                r_bonus = r_sim + self.kappa * np.sqrt(tau)

                if done_sim:
                    target_sim = r_bonus
                else:
                    target_sim = r_bonus + self.gamma * np.max(self.Q[s_prime_sim])

                self.Q[s_sim, a_sim] += self.alpha * (target_sim - self.Q[s_sim, a_sim])

Comparing Dyna to Alternatives

Pure Q-Learning
+ Simple, no model needed
+ No model errors
- Poor sample efficiency
Dyna-Q
+ Great sample efficiency
+ Simple architecture
- Requires good model
Pure Planning (MPC)
+ No value function needed
+ Flexible at test time
- Expensive per decision

Summary

The Dyna architecture elegantly combines the best of model-free and model-based RL:

  • Sample efficiency from model-based planning
  • Simplicity of Q-learning updates
  • Flexibility to adjust planning effort vs computation

The key insight is that real and simulated experience can be treated identically for learning—the Q-learning update doesn’t care where the (s,a,r,s)(s, a, r, s') tuple came from. This makes Dyna easy to implement and reason about.

In the next section, we’ll see how modern methods like MuZero take model-based RL to the next level by learning abstract models optimized for planning rather than state prediction.