The Markov Property
The “M” in MDP stands for Markov, named after mathematician Andrey Markov. The Markov property is the key assumption that makes reinforcement learning tractable.
The Core Idea
A state is Markov if and only if the future is independent of the past given the present:
In words: knowing the current state is enough to predict the next state. History doesn’t provide additional information.
Imagine you’re playing chess. You look at the board and see a specific configuration. To decide your next move and predict what might happen, do you need to know:
- How the pieces got to their current positions? No.
- What moves were played earlier? No.
- The current position of all pieces? Yes, that’s all you need.
The current board position is a Markov state—it contains all the information needed to predict the future.
Why It Matters
The Markov property is powerful because it dramatically simplifies reasoning:
❌ Without Markov Property
- Must track entire history
- State space grows exponentially with time
- Learning becomes intractable
- No simple recursive structure
✓ With Markov Property
- Only need current state
- State space is fixed
- Algorithms can be efficient
- Bellman equations become possible
The Markov property enables the Bellman equations, which we’ll study in the next chapter. These equations express values recursively:
This recursion only works because depends only on , not on how we reached . If history mattered, we’d need —an intractable representation.
Is the Real World Markov?
Strictly speaking, most real-world problems are not perfectly Markov. However, the Markov property isn’t about the world—it’s about our state representation.
Consider a thermostat controlling room temperature:
Non-Markov representation: State = current temperature (e.g., 70°F)
- Problem: Is the room heating up or cooling down? The same temperature can lead to different futures depending on recent history.
Markov representation: State = (current temperature, rate of change, HVAC status)
- Now the state contains enough information to predict what happens next.
The key insight: We can make any problem Markov by choosing a rich enough state representation.
Consider a robot that observes only its position, not velocity:
Observation: Robot at position (2, 3)
Is this Markov? To predict the next position, we need to know:
- Which direction is it moving?
- How fast?
Just knowing the position isn’t enough—the same position could lead to different futures.
Solution: Include velocity in the state:
Now the state is Markov. This is why robotics and physics often use position + velocity (or position + momentum) as state.
State vs Observation
A crucial distinction:
- State: Complete information needed for Markov property to hold
- Observation: What the agent actually sees (might be incomplete)
When observations don’t capture the full state, we have a Partially Observable MDP (POMDP).
In poker:
- Full state: All cards in the deck, who has what
- Observation: Your own cards and visible community cards
You can’t see opponents’ cards, so your observation is not Markov. Different hidden cards lead to different futures from the same observation.
Solutions:
- Maintain a belief state (probability distribution over hidden states)
- Use history of observations as a proxy for state
- Learn to approximate the belief from observation sequences
Designing Markov States
When designing an RL system, ask yourself: “Does my state representation contain everything needed to predict future rewards?”
If not, you need to expand the state.
Common strategies:
- Include derivatives: Position + velocity, not just position
- Include history: Stack recent observations (common in Atari games)
- Include hidden state: For recurrent systems, use LSTM/GRU hidden states
- Include relevant context: Time of day, user preferences, etc.
Here’s how game-playing agents often handle partially observable situations:
# Frame stacking: Use recent frames as "state"
# Common for Atari games where velocity matters
def stack_frames(observation, frame_stack, n_frames=4):
"""Stack last n frames to approximate Markov state."""
frame_stack.append(observation)
if len(frame_stack) > n_frames:
frame_stack.pop(0)
# Pad if we don't have enough frames yet
while len(frame_stack) < n_frames:
frame_stack.insert(0, frame_stack[0])
return np.stack(frame_stack, axis=-1)
# Example: Atari Breakout
# Single frame: Can't tell if ball is going up or down
# 4 stacked frames: Ball position over time reveals velocity
state = stack_frames(current_frame, frame_history, n_frames=4)This is why DQN and many Atari agents use 4 consecutive frames as input—a single frame isn’t Markov (you can’t tell which way the ball is moving), but 4 frames approximately are.
The Markov Assumption in Practice
In practice, we often proceed as if states are Markov, even when they’re not perfectly so:
- Good enough works: Small violations of Markov property often don’t matter much
- Approximate solutions: RL algorithms are robust to some non-Markovness
- Learned representations: Deep RL can learn to extract Markov-ish features
For theoretical guarantees, we need the exact Markov property. But practically:
If this approximation is good, RL algorithms will work. How “good” is domain-dependent.
Summary
The Markov property states that the current state contains all information needed to predict the future:
- It’s not about the world—it’s about our state representation
- We can make states Markov by including enough information
- When observations aren’t Markov, we enter POMDP territory
- In practice, approximate Markovness often suffices
This property is what makes RL tractable. It enables the recursive structure (Bellman equations) that underlies all RL algorithms.
We’ll see the Markov property at work in the next chapter when we derive the Bellman equations. Without it, those elegant recursive formulations wouldn’t be possible.