Chapter 11
📝Draft

Introduction to MDPs

The mathematical framework that formalizes sequential decision-making

Prerequisites:

Introduction to Markov Decision Processes

What You'll Learn

  • Understand why bandits are insufficient for sequential decision problems
  • Define all components of an MDP: states, actions, transitions, rewards, discount
  • Explain the Markov property and why it’s so powerful
  • Construct simple MDPs from problem descriptions

In bandits, every pull of the lever was independent—your choice didn’t change the world. But what if your actions have lasting consequences? What if where you go affects where you can go next?

Welcome to Markov Decision Processes (MDPs), the mathematical language of sequential decision-making.

Interactive GridWorld

Watch an agent learn to reach the goal. Click "Step" to see each action.

🤖
🎯
Click "Step" to begin, or "Play" for automatic stepping
0
Steps
0
Total Reward
🤖Agent
🎯Goal (+10)
Empty (-1 per step)
What's happening? The agent has already learned an optimal policy (shown with arrows when you click "Show Policy"). Each step costs -1 reward, encouraging the shortest path. Reaching the goal gives +10 reward. A well-trained agent maximizes total reward.

Why MDPs?

Consider teaching a robot to navigate a building. The robot’s current position determines which positions it can move to next. A bandit framework fails here because:

  • Actions change the state (robot’s position)
  • Future rewards depend on the sequence of actions
  • There’s no concept of “independent trials”

MDPs add what bandits lack: state and state transitions. They provide a formal framework for problems where decisions cascade through time.

Chapter Overview

This chapter introduces the MDP framework—the mathematical foundation for nearly all of reinforcement learning. We’ll cover:

The Big Picture

An MDP describes a world where an agent:

  1. Observes its current state
  2. Takes an action
  3. Transitions to a new state (possibly randomly)
  4. Receives a reward

The goal is to find a policy—a way of choosing actions—that maximizes long-term cumulative reward.

📖Markov Decision Process

A formal framework for sequential decision-making where outcomes depend on both the current state and chosen action. The “Markov” property means the future depends only on the present state, not on how you got there.

Prerequisites

This chapter assumes you’re comfortable with:

  • The agent-environment interaction from The RL Framework
  • Basic probability notation
  • The exploration-exploitation tradeoff from bandits

Key Questions We’ll Answer

  • What exactly is a “state” and how do we represent it?
  • How do we describe uncertain transitions mathematically?
  • Why is the discount factor necessary for some problems?
  • What does “Markov” really mean, and why does it matter?

Key Takeaways

  • MDPs extend bandits by adding states and transitions
  • The framework consists of five components: S,A,P,R,γ\mathcal{S}, \mathcal{A}, P, R, \gamma
  • The Markov property says the present contains all relevant history
  • Designing good states is as important as choosing good algorithms
Next ChapterValue Functions