Introduction to Markov Decision Processes
What You'll Learn
- Understand why bandits are insufficient for sequential decision problems
- Define all components of an MDP: states, actions, transitions, rewards, discount
- Explain the Markov property and why it’s so powerful
- Construct simple MDPs from problem descriptions
In bandits, every pull of the lever was independent—your choice didn’t change the world. But what if your actions have lasting consequences? What if where you go affects where you can go next?
Welcome to Markov Decision Processes (MDPs), the mathematical language of sequential decision-making.
Interactive GridWorld
Watch an agent learn to reach the goal. Click "Step" to see each action.
Why MDPs?
Consider teaching a robot to navigate a building. The robot’s current position determines which positions it can move to next. A bandit framework fails here because:
- Actions change the state (robot’s position)
- Future rewards depend on the sequence of actions
- There’s no concept of “independent trials”
MDPs add what bandits lack: state and state transitions. They provide a formal framework for problems where decisions cascade through time.
Chapter Overview
This chapter introduces the MDP framework—the mathematical foundation for nearly all of reinforcement learning. We’ll cover:
From Bandits to MDPs
Why bandits aren’t enough: when actions have lasting consequences
The MDP Components
States, actions, transitions, rewards, and the discount factor
The Markov Property
The key assumption that makes RL tractable
The Big Picture
An MDP describes a world where an agent:
- Observes its current state
- Takes an action
- Transitions to a new state (possibly randomly)
- Receives a reward
The goal is to find a policy—a way of choosing actions—that maximizes long-term cumulative reward.
A formal framework for sequential decision-making where outcomes depend on both the current state and chosen action. The “Markov” property means the future depends only on the present state, not on how you got there.
Prerequisites
This chapter assumes you’re comfortable with:
- The agent-environment interaction from The RL Framework
- Basic probability notation
- The exploration-exploitation tradeoff from bandits
Key Questions We’ll Answer
- What exactly is a “state” and how do we represent it?
- How do we describe uncertain transitions mathematically?
- Why is the discount factor necessary for some problems?
- What does “Markov” really mean, and why does it matter?
Key Takeaways
- MDPs extend bandits by adding states and transitions
- The framework consists of five components:
- The Markov property says the present contains all relevant history
- Designing good states is as important as choosing good algorithms