Model-Based Reinforcement Learning
What You'll Learn
- Distinguish between model-free and model-based RL
- Explain the sample efficiency advantage of model-based methods
- Implement the Dyna architecture
- Describe how to learn environment models
- Explain the model bias problem and how to mitigate it
- Describe modern model-based methods including MuZero
Every model-free method we’ve seen learns by trial and error in the real world. But what if the agent could imagine experiences? What if it had a model of how the world works and could plan ahead, like a chess player thinking several moves ahead?
This is the fundamental insight behind model-based reinforcement learning: if you understand how the world works, you can simulate experiences in your head and learn from them without ever actually interacting with the environment.
Think about how you plan a road trip. You don’t need to actually drive every possible route to know which is fastest. Instead, you have a mental model of the road network, traffic patterns, and driving times. You use this model to plan, evaluating routes in your imagination before committing to one.
Model-based RL gives agents this same capability: learn how the world works, then use that knowledge to plan efficiently.
Chapter Overview
The Big Picture
A learned approximation of the environment’s dynamics: the transition function that predicts next states, and the reward function that predicts rewards. With a model, agents can “imagine” experiences without actually interacting with the environment.
Model-based RL learns a model of the environment (how states transition, what rewards occur) and uses it to plan or generate synthetic experience. This can be far more sample-efficient than model-free learning, but introduces the challenge of model errors.
Consider the difference:
- Model-free: Learn solely from real interactions. Each environment step is precious and used once.
- Model-based: Learn a model from real interactions, then use that model to generate unlimited simulated experience.
It’s like the difference between learning to cook by only making real dishes (expensive, slow) versus first understanding the principles of cooking (what flavors combine well, how heat affects food) and then practicing in your head before touching ingredients.
When to Use Model-Based RL
- Real interactions are expensive or slow
- Safety is critical (robotics, healthcare)
- The environment dynamics are relatively simple
- You need sample efficiency
- Environment dynamics are very complex
- You have abundant simulation access
- Model errors would compound badly
- Compute is more expensive than samples
New to model-based RL? Begin with Learning World Models to understand what models are and how they’re learned, then continue through the sections in order.
Key Takeaways
- Model-based RL learns a model of the environment to enable planning and sample-efficient learning
- The Dyna architecture combines real experience with simulated experience from the model
- Model-based methods are more sample-efficient but require accurate models
- Model errors can compound during planning, leading to poor policies
- Modern methods like MuZero learn abstract models optimized for planning, not state prediction