Introduction to Markov Decision Processes

What You'll Learn

Understand why bandits are insufficient for sequential decision problems
Define all components of an MDP: states, actions, transitions, rewards, discount
Explain the Markov property and why it’s so powerful
Construct simple MDPs from problem descriptions

In bandits, every pull of the lever was independent—your choice didn’t change the world. But what if your actions have lasting consequences? What if where you go affects where you can go next?

Welcome to Markov Decision Processes (MDPs), the mathematical language of sequential decision-making.

Interactive GridWorld

Watch an agent learn to reach the goal. Click "Step" to see each action.

🤖

🎯

Click "Step" to begin, or "Play" for automatic stepping

Steps

Total Reward

🤖Agent

🎯Goal (+10)

Empty (-1 per step)

What's happening? The agent has already learned an optimal policy (shown with arrows when you click "Show Policy"). Each step costs -1 reward, encouraging the shortest path. Reaching the goal gives +10 reward. A well-trained agent maximizes total reward.

Why MDPs?

Consider teaching a robot to navigate a building. The robot’s current position determines which positions it can move to next. A bandit framework fails here because:

Actions change the state (robot’s position)
Future rewards depend on the sequence of actions
There’s no concept of “independent trials”

MDPs add what bandits lack: state and state transitions. They provide a formal framework for problems where decisions cascade through time.

Chapter Overview

This chapter introduces the MDP framework—the mathematical foundation for nearly all of reinforcement learning. We’ll cover:

From Bandits to MDPs

Why bandits aren’t enough: when actions have lasting consequences

The MDP Components

States, actions, transitions, rewards, and the discount factor

The Markov Property

The key assumption that makes RL tractable

The Big Picture

An MDP describes a world where an agent:

Observes its current state
Takes an action
Transitions to a new state (possibly randomly)
Receives a reward

The goal is to find a policy—a way of choosing actions—that maximizes long-term cumulative reward.

📖Markov Decision Process

A formal framework for sequential decision-making where outcomes depend on both the current state and chosen action. The “Markov” property means the future depends only on the present state, not on how you got there.

Prerequisites

This chapter assumes you’re comfortable with:

The agent-environment interaction from The RL Framework
Basic probability notation
The exploration-exploitation tradeoff from bandits

Key Questions We’ll Answer

What exactly is a “state” and how do we represent it?
How do we describe uncertain transitions mathematically?
Why is the discount factor necessary for some problems?
What does “Markov” really mean, and why does it matter?

Key Takeaways

MDPs extend bandits by adding states and transitions
The framework consists of five components: $\mathcal{S}, \mathcal{A}, P, R, \gamma$
The Markov property says the present contains all relevant history
Designing good states is as important as choosing good algorithms

Next ChapterValue Functions→