Model-Based Reinforcement Learning

What You'll Learn

Distinguish between model-free and model-based RL
Explain the sample efficiency advantage of model-based methods
Implement the Dyna architecture
Describe how to learn environment models
Explain the model bias problem and how to mitigate it
Describe modern model-based methods including MuZero

Every model-free method we’ve seen learns by trial and error in the real world. But what if the agent could imagine experiences? What if it had a model of how the world works and could plan ahead, like a chess player thinking several moves ahead?

This is the fundamental insight behind model-based reinforcement learning: if you understand how the world works, you can simulate experiences in your head and learn from them without ever actually interacting with the environment.

Think about how you plan a road trip. You don’t need to actually drive every possible route to know which is fastest. Instead, you have a mental model of the road network, traffic patterns, and driving times. You use this model to plan, evaluating routes in your imagination before committing to one.

Model-based RL gives agents this same capability: learn how the world works, then use that knowledge to plan efficiently.

Chapter Overview

In this chapter:

Learning World Models

What is a model, how to learn one from experience, and the fundamental tradeoff with model-free methods

Planning with Learned Models

Using imagination for better decisions: how models enable planning and sample efficiency

The Dyna Architecture

Combining real and simulated experience in a unified framework for sample-efficient learning

MuZero and Beyond

Modern approaches that learn abstract models optimized for planning, without explicit state prediction

The Big Picture

📖Environment Model

A learned approximation of the environment’s dynamics: the transition function $P(s'|s,a)$ that predicts next states, and the reward function $R(s,a)$ that predicts rewards. With a model, agents can “imagine” experiences without actually interacting with the environment.

Model-based RL learns a model of the environment (how states transition, what rewards occur) and uses it to plan or generate synthetic experience. This can be far more sample-efficient than model-free learning, but introduces the challenge of model errors.

Consider the difference:

Model-free: Learn solely from real interactions. Each environment step is precious and used once.
Model-based: Learn a model from real interactions, then use that model to generate unlimited simulated experience.

It’s like the difference between learning to cook by only making real dishes (expensive, slow) versus first understanding the principles of cooking (what flavors combine well, how heat affects food) and then practicing in your head before touching ingredients.

When to Use Model-Based RL

Model-Based is Great When:

Real interactions are expensive or slow
Safety is critical (robotics, healthcare)
The environment dynamics are relatively simple
You need sample efficiency

Model-Free May Be Better When:

Environment dynamics are very complex
You have abundant simulation access
Model errors would compound badly
Compute is more expensive than samples

💡Start Here

New to model-based RL? Begin with Learning World Models to understand what models are and how they’re learned, then continue through the sections in order.

Key Takeaways

Model-based RL learns a model of the environment to enable planning and sample-efficient learning
The Dyna architecture combines real experience with simulated experience from the model
Model-based methods are more sample-efficient but require accurate models
Model errors can compound during planning, leading to poor policies
Modern methods like MuZero learn abstract models optimized for planning, not state prediction

Next ChapterMulti-Agent RL→