From Bandits to Sequential Decisions
In the bandit problems we explored earlier, every decision was independent. Pull an arm, get a reward, repeat. Your choice didn’t affect what would happen next time.
But most interesting problems aren’t like that. Consider:
- Playing chess: Your move changes the board, affecting all future possibilities
- Treating a patient: Today’s medication affects tomorrow’s symptoms
- Navigating a city: Where you go now determines where you can go next
These are sequential decision problems. Your actions don’t just generate rewards—they change the world.
What Bandits Miss
Imagine you’re a robot at a T-intersection. You can go left or right.
- Go left: You find a +5 reward, but now you’re at a dead end
- Go right: You find a +1 reward, but the path continues to a +100 treasure
A bandit would tell you: “Left is better! +5 > +1!”
But that’s wrong. Right leads to much greater total reward. The bandit framework fails because it ignores where your action takes you.
The key insight is that in sequential problems:
- Actions have lasting consequences beyond immediate reward
- The situation changes based on what you do
- Future rewards depend on your current choice
We need a framework that captures this—and that’s the Markov Decision Process.
The Missing Piece: State
A complete description of the situation that the agent faces. The state contains all information needed to predict future rewards and transitions.
States are the key addition that MDPs bring. In bandits, there was implicitly only one state—you were always in the same situation. In MDPs:
- States represent where you are
- Actions move you between states
- Rewards depend on states and transitions
Consider a 4×4 grid where an agent wants to reach a goal:
- States: Each of the 16 cells is a different state
- Actions: Up, Down, Left, Right
- Transitions: Moving in a direction changes which cell you’re in
- Rewards: -1 per step (encourages efficiency), +10 for reaching the goal
This is fundamentally different from a bandit. Your “left” action from cell (0,0) does something completely different than “left” from cell (2,1).
Sequential vs Independent Decisions
Let’s make the distinction concrete:
🎰 Bandit Problems
- Each decision is independent
- No concept of “state”
- Action → Reward, that’s it
- Goal: Find the best single action
🗺️ Sequential Decisions (MDPs)
- Decisions affect future situations
- Current state determines options
- Action → Reward + New State
- Goal: Find the best sequence of actions
When Do You Need an MDP?
Ask yourself these questions:
If you answer “yes” to any of these, you’re dealing with a sequential decision problem:
- Does my action change what happens next?
- Do I need to think multiple steps ahead?
- Does the best choice now depend on where I’ll end up?
Here are some examples categorized by type:
Formally, a bandit is actually a special case of an MDP with only one state. If we have state space , then:
- Every action returns you to the same state
- No sequential structure exists
- It reduces to: pick the best arm
MDPs generalize this by allowing multiple states and transitions between them.
The Transition to MDPs
What we need is a framework that captures:
- States: Different situations the agent can be in
- Actions: Choices available in each state
- Transitions: How actions move you between states
- Rewards: Feedback for each transition
- Time: Discounting to handle infinite horizons
This is exactly what Markov Decision Processes provide.
The name “Markov Decision Process” comes from Andrey Markov, who studied stochastic processes with the “memoryless” property. We’ll explore what this means in the Markov Property section.
In the next section, we’ll formalize these five components and see how they work together.