Markov Decision Processes • Part 1 of 3
📝Draft

From Bandits to Sequential Decisions

Why we need states: when actions have lasting consequences

From Bandits to Sequential Decisions

In the bandit problems we explored earlier, every decision was independent. Pull an arm, get a reward, repeat. Your choice didn’t affect what would happen next time.

But most interesting problems aren’t like that. Consider:

  • Playing chess: Your move changes the board, affecting all future possibilities
  • Treating a patient: Today’s medication affects tomorrow’s symptoms
  • Navigating a city: Where you go now determines where you can go next

These are sequential decision problems. Your actions don’t just generate rewards—they change the world.

What Bandits Miss

Imagine you’re a robot at a T-intersection. You can go left or right.

  • Go left: You find a +5 reward, but now you’re at a dead end
  • Go right: You find a +1 reward, but the path continues to a +100 treasure

A bandit would tell you: “Left is better! +5 > +1!”

But that’s wrong. Right leads to much greater total reward. The bandit framework fails because it ignores where your action takes you.

The key insight is that in sequential problems:

  1. Actions have lasting consequences beyond immediate reward
  2. The situation changes based on what you do
  3. Future rewards depend on your current choice

We need a framework that captures this—and that’s the Markov Decision Process.

The Missing Piece: State

📖State

A complete description of the situation that the agent faces. The state contains all information needed to predict future rewards and transitions.

States are the key addition that MDPs bring. In bandits, there was implicitly only one state—you were always in the same situation. In MDPs:

  • States represent where you are
  • Actions move you between states
  • Rewards depend on states and transitions
📌GridWorld as an MDP

Consider a 4×4 grid where an agent wants to reach a goal:

🤖
🎯
  • States: Each of the 16 cells is a different state
  • Actions: Up, Down, Left, Right
  • Transitions: Moving in a direction changes which cell you’re in
  • Rewards: -1 per step (encourages efficiency), +10 for reaching the goal

This is fundamentally different from a bandit. Your “left” action from cell (0,0) does something completely different than “left” from cell (2,1).

Sequential vs Independent Decisions

Let’s make the distinction concrete:

🎰 Bandit Problems

  • Each decision is independent
  • No concept of “state”
  • Action → Reward, that’s it
  • Goal: Find the best single action

🗺️ Sequential Decisions (MDPs)

  • Decisions affect future situations
  • Current state determines options
  • Action → Reward + New State
  • Goal: Find the best sequence of actions

When Do You Need an MDP?

Ask yourself these questions:

💡Tip

If you answer “yes” to any of these, you’re dealing with a sequential decision problem:

  1. Does my action change what happens next?
  2. Do I need to think multiple steps ahead?
  3. Does the best choice now depend on where I’ll end up?

Here are some examples categorized by type:

Bandit
Choosing which ad to show
Each impression is independent
MDP
Playing a video game
Actions change the game state
Bandit
A/B testing a webpage
Users don’t return (or we treat each visit independently)
MDP
Training a robot arm
Each movement changes position
Depends
Recommending a movie
Single recommendation = bandit; session = MDP
Mathematical Details

Formally, a bandit is actually a special case of an MDP with only one state. If we have state space S={s0}\mathcal{S} = \{s_0\}, then:

  • Every action returns you to the same state
  • No sequential structure exists
  • It reduces to: pick the best arm

MDPs generalize this by allowing multiple states and transitions between them.

The Transition to MDPs

What we need is a framework that captures:

  • States: Different situations the agent can be in
  • Actions: Choices available in each state
  • Transitions: How actions move you between states
  • Rewards: Feedback for each transition
  • Time: Discounting to handle infinite horizons

This is exactly what Markov Decision Processes provide.

ℹ️Note

The name “Markov Decision Process” comes from Andrey Markov, who studied stochastic processes with the “memoryless” property. We’ll explore what this means in the Markov Property section.

In the next section, we’ll formalize these five components and see how they work together.