From Bandits to Sequential Decisions

In the bandit problems we explored earlier, every decision was independent. Pull an arm, get a reward, repeat. Your choice didn’t affect what would happen next time.

But most interesting problems aren’t like that. Consider:

Playing chess: Your move changes the board, affecting all future possibilities
Treating a patient: Today’s medication affects tomorrow’s symptoms
Navigating a city: Where you go now determines where you can go next

These are sequential decision problems. Your actions don’t just generate rewards—they change the world.

What Bandits Miss

Imagine you’re a robot at a T-intersection. You can go left or right.

Go left: You find a +5 reward, but now you’re at a dead end
Go right: You find a +1 reward, but the path continues to a +100 treasure

A bandit would tell you: “Left is better! +5 > +1!”

But that’s wrong. Right leads to much greater total reward. The bandit framework fails because it ignores where your action takes you.

The key insight is that in sequential problems:

Actions have lasting consequences beyond immediate reward
The situation changes based on what you do
Future rewards depend on your current choice

We need a framework that captures this—and that’s the Markov Decision Process.

The Missing Piece: State

📖State

A complete description of the situation that the agent faces. The state contains all information needed to predict future rewards and transitions.

States are the key addition that MDPs bring. In bandits, there was implicitly only one state—you were always in the same situation. In MDPs:

States represent where you are
Actions move you between states
Rewards depend on states and transitions

📌GridWorld as an MDP

Consider a 4×4 grid where an agent wants to reach a goal:

🤖

🎯

States: Each of the 16 cells is a different state
Actions: Up, Down, Left, Right
Transitions: Moving in a direction changes which cell you’re in
Rewards: -1 per step (encourages efficiency), +10 for reaching the goal

This is fundamentally different from a bandit. Your “left” action from cell (0,0) does something completely different than “left” from cell (2,1).

Sequential vs Independent Decisions

Let’s make the distinction concrete:

🎰 Bandit Problems

Each decision is independent
No concept of “state”
Action → Reward, that’s it
Goal: Find the best single action

🗺️ Sequential Decisions (MDPs)

Decisions affect future situations
Current state determines options
Action → Reward + New State
Goal: Find the best sequence of actions

When Do You Need an MDP?

Ask yourself these questions:

💡Tip

If you answer “yes” to any of these, you’re dealing with a sequential decision problem:

Does my action change what happens next?
Do I need to think multiple steps ahead?
Does the best choice now depend on where I’ll end up?

Here are some examples categorized by type:

Bandit

Choosing which ad to show

Each impression is independent

MDP

Playing a video game

Actions change the game state

Bandit

A/B testing a webpage

Users don’t return (or we treat each visit independently)

MDP

Training a robot arm

Each movement changes position

Depends

Recommending a movie

Single recommendation = bandit; session = MDP

∑Mathematical Details

Formally, a bandit is actually a special case of an MDP with only one state. If we have state space $\mathcal{S} = \{s_0\}$ , then:

Every action returns you to the same state
No sequential structure exists
It reduces to: pick the best arm

MDPs generalize this by allowing multiple states and transitions between them.

The Transition to MDPs

What we need is a framework that captures:

States: Different situations the agent can be in
Actions: Choices available in each state
Transitions: How actions move you between states
Rewards: Feedback for each transition
Time: Discounting to handle infinite horizons

This is exactly what Markov Decision Processes provide.

ℹ️Note

The name “Markov Decision Process” comes from Andrey Markov, who studied stochastic processes with the “memoryless” property. We’ll explore what this means in the Markov Property section.

In the next section, we’ll formalize these five components and see how they work together.