The MDP Components
A Markov Decision Process is defined by five components, often written as a tuple:
Let’s explore each one.
1. States ()
The set of all possible situations the agent can be in. We denote it and individual states as .
States answer the question: “Where am I?” or “What situation am I in?”
- In a maze: which cell you’re in
- In chess: the current board configuration
- In a dialogue system: the conversation history so far
- In robotics: joint angles, velocities, and sensor readings
States can be:
- Discrete: Finite or countable set (cells in a grid, chess positions)
- Continuous: Infinite, real-valued (robot arm angles, car velocity)
In a 4×4 GridWorld, there are 16 possible states. We can represent each as coordinates:
Or equivalently as integers 0-15. The representation doesn’t matter as long as it uniquely identifies each situation.
2. Actions ()
The set of choices available to the agent. We denote it , and the action at time as .
Actions answer: “What can I do?”
- In a maze: Up, Down, Left, Right
- In chess: All legal moves from current position
- In robotics: Motor torques or velocity commands
- In dialogue: What response to generate
In some MDPs, available actions depend on the current state. We write for the actions available in state . In chess, legal moves depend on the board position.
3. Transition Dynamics ()
The probability of reaching state when taking action in state :
The transition function answers: “If I do this, where might I end up?”
Transitions can be:
- Deterministic: One action always leads to the same next state
- Stochastic: There’s randomness in where you end up
Imagine a robot on an icy floor. When it tries to move right:
- 80% chance: it actually moves right
- 10% chance: it slips and stays in place
- 10% chance: it slips and moves down
This is captured by the transition probabilities:
For any state-action pair, the transition probabilities must form a valid distribution:
The transition function fully describes the environment’s dynamics. In model-free RL, we don’t know —we must learn from experience. In model-based RL, we either know or learn .
4. Reward Function ()
The feedback signal that tells the agent how good a transition was. Common formulations include:
- : Reward depends on state, action, and next state
- : Reward depends on state and action
- : Reward depends only on state
Rewards answer: “How did that go?”
They encode the goal:
- Reaching the goal: +10
- Falling in a pit: -10
- Each step: -1 (encourages efficiency)
- Winning the game: +1
- Losing: -1
A typical GridWorld might have:
- for all non-terminal states (step penalty)
- (reward for reaching the goal)
- (penalty for falling in a pit)
The step penalty encourages finding the shortest path—otherwise, the agent might wander forever.
Reward design is crucial and tricky. The agent will maximize whatever reward you give it—which might not be what you intended. This is a core challenge in RL.
When we write the expected reward:
This averages over the randomness in transitions. If rewards depend on :
5. Discount Factor ()
A number that determines how much we value future rewards relative to immediate ones. The discounted return is:
The discount factor answers: “How much do I care about the future?”
- : Completely myopic—only care about immediate reward
- : A reward 10 steps away is worth about 35% of an immediate reward
- : A reward 100 steps away is worth about 37% of an immediate reward
- : Future rewards matter just as much as immediate ones
Consider an agent that gets +1 reward every step forever. Without discounting:
Every state would have infinite value! Discounting gives us:
With , this is a finite value of 10.
The discount factor serves multiple purposes:
- Mathematical: Ensures finite returns for infinite-horizon problems
- Economic: Models uncertainty (future might not happen)
- Practical: Bounds how far ahead we need to plan
The geometric series gives us:
If rewards are bounded by , then returns are bounded by .
Episodic vs Continuing Tasks
- Episodic: Has a natural end (terminal state). Examples: games, robot reaching a goal
- Continuing: Goes on forever. Examples: temperature control, stock trading
For episodic tasks, we can use since returns are naturally finite. For continuing tasks, we need to ensure convergence.
A common choice is . It’s close enough to 1 that the agent cares about the future, but bounded enough for stable learning.
Putting It Together
Here’s how we might represent a simple MDP in Python:
class MDP:
"""A simple Markov Decision Process."""
def __init__(self, states, actions, transitions, rewards, gamma=0.99):
self.states = states # List of state names
self.actions = actions # List of action names
self.transitions = transitions # P[s][a] = [(s', prob), ...]
self.rewards = rewards # R[s] or R[s][a] -> float
self.gamma = gamma
def get_transitions(self, state, action):
"""Return list of (next_state, probability) tuples."""
return self.transitions[state][action]
def get_reward(self, state, action=None, next_state=None):
"""Return the reward for a transition."""
if callable(self.rewards):
return self.rewards(state, action, next_state)
return self.rewards.get(state, 0)
# Example: Simple 3-state MDP
mdp = MDP(
states=['start', 'middle', 'goal'],
actions=['left', 'right'],
transitions={
'start': {
'right': [('middle', 1.0)],
'left': [('start', 1.0)], # Wall
},
'middle': {
'right': [('goal', 0.8), ('start', 0.2)], # Stochastic!
'left': [('start', 1.0)],
},
'goal': {
'right': [('goal', 1.0)], # Terminal
'left': [('goal', 1.0)],
},
},
rewards={'goal': 10, 'start': -1, 'middle': -1},
gamma=0.99
)Summary
Every MDP is defined by these five components:
Together, they provide a complete mathematical description of a sequential decision problem. In the next section, we’ll explore the crucial Markov property that makes this framework tractable.