Markov Decision Processes • Part 2 of 3
📝Draft

The MDP Components

States, actions, transitions, rewards, and discount factors

The MDP Components

A Markov Decision Process is defined by five components, often written as a tuple:

M=(S,A,P,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)

Let’s explore each one.

1. States (S\mathcal{S})

📖State Space

The set of all possible situations the agent can be in. We denote it S\mathcal{S} and individual states as sSs \in \mathcal{S}.

States answer the question: “Where am I?” or “What situation am I in?”

  • In a maze: which cell you’re in
  • In chess: the current board configuration
  • In a dialogue system: the conversation history so far
  • In robotics: joint angles, velocities, and sensor readings

States can be:

  • Discrete: Finite or countable set (cells in a grid, chess positions)
  • Continuous: Infinite, real-valued (robot arm angles, car velocity)
📌GridWorld States

In a 4×4 GridWorld, there are 16 possible states. We can represent each as coordinates:

S={(0,0),(0,1),...,(3,3)}\mathcal{S} = \{(0,0), (0,1), ..., (3,3)\}

Or equivalently as integers 0-15. The representation doesn’t matter as long as it uniquely identifies each situation.

2. Actions (A\mathcal{A})

📖Action Space

The set of choices available to the agent. We denote it A\mathcal{A}, and the action at time tt as AtA_t.

Actions answer: “What can I do?”

  • In a maze: Up, Down, Left, Right
  • In chess: All legal moves from current position
  • In robotics: Motor torques or velocity commands
  • In dialogue: What response to generate
ℹ️Note

In some MDPs, available actions depend on the current state. We write A(s)\mathcal{A}(s) for the actions available in state ss. In chess, legal moves depend on the board position.

3. Transition Dynamics (PP)

📖Transition Function

The probability of reaching state ss' when taking action aa in state ss:

P(ss,a)=Pr(St+1=sSt=s,At=a)P(s' | s, a) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)

The transition function answers: “If I do this, where might I end up?”

Transitions can be:

  • Deterministic: One action always leads to the same next state
  • Stochastic: There’s randomness in where you end up
📌Slippery Ice

Imagine a robot on an icy floor. When it tries to move right:

  • 80% chance: it actually moves right
  • 10% chance: it slips and stays in place
  • 10% chance: it slips and moves down

This is captured by the transition probabilities:

  • P(right cells,move-right)=0.8P(\text{right cell} | s, \text{move-right}) = 0.8
  • P(ss,move-right)=0.1P(s | s, \text{move-right}) = 0.1
  • P(down cells,move-right)=0.1P(\text{down cell} | s, \text{move-right}) = 0.1
Mathematical Details

For any state-action pair, the transition probabilities must form a valid distribution:

sSP(ss,a)=1for all s,a\sum_{s' \in \mathcal{S}} P(s' | s, a) = 1 \quad \text{for all } s, a

The transition function fully describes the environment’s dynamics. In model-free RL, we don’t know PP—we must learn from experience. In model-based RL, we either know or learn PP.

4. Reward Function (RR)

📖Reward Function

The feedback signal that tells the agent how good a transition was. Common formulations include:

  • R(s,a,s)R(s, a, s'): Reward depends on state, action, and next state
  • R(s,a)R(s, a): Reward depends on state and action
  • R(s)R(s): Reward depends only on state

Rewards answer: “How did that go?”

They encode the goal:

  • Reaching the goal: +10
  • Falling in a pit: -10
  • Each step: -1 (encourages efficiency)
  • Winning the game: +1
  • Losing: -1
📌GridWorld Rewards

A typical GridWorld might have:

  • R(s)=1R(s) = -1 for all non-terminal states (step penalty)
  • R(goal)=+10R(\text{goal}) = +10 (reward for reaching the goal)
  • R(pit)=10R(\text{pit}) = -10 (penalty for falling in a pit)

The step penalty encourages finding the shortest path—otherwise, the agent might wander forever.

Mathematical Details

When we write the expected reward:

R(s,a)=E[Rt+1St=s,At=a]R(s, a) = \mathbb{E}[R_{t+1} | S_t = s, A_t = a]

This averages over the randomness in transitions. If rewards depend on ss':

R(s,a)=sP(ss,a)R(s,a,s)R(s, a) = \sum_{s'} P(s' | s, a) \cdot R(s, a, s')

5. Discount Factor (γ\gamma)

📖Discount Factor

A number γ[0,1]\gamma \in [0, 1] that determines how much we value future rewards relative to immediate ones. The discounted return is:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The discount factor answers: “How much do I care about the future?”

  • γ=0\gamma = 0: Completely myopic—only care about immediate reward
  • γ=0.9\gamma = 0.9: A reward 10 steps away is worth about 35% of an immediate reward
  • γ=0.99\gamma = 0.99: A reward 100 steps away is worth about 37% of an immediate reward
  • γ=1\gamma = 1: Future rewards matter just as much as immediate ones
📌Why Discount?

Consider an agent that gets +1 reward every step forever. Without discounting:

Gt=1+1+1+...=G_t = 1 + 1 + 1 + ... = \infty

Every state would have infinite value! Discounting gives us:

Gt=1+γ+γ2+...=11γG_t = 1 + \gamma + \gamma^2 + ... = \frac{1}{1 - \gamma}

With γ=0.9\gamma = 0.9, this is a finite value of 10.

Mathematical Details

The discount factor serves multiple purposes:

  1. Mathematical: Ensures finite returns for infinite-horizon problems
  2. Economic: Models uncertainty (future might not happen)
  3. Practical: Bounds how far ahead we need to plan

The geometric series gives us:

k=0γk=11γwhen γ<1\sum_{k=0}^{\infty} \gamma^k = \frac{1}{1-\gamma} \quad \text{when } |\gamma| < 1

If rewards are bounded by RmaxR_{\max}, then returns are bounded by Rmax1γ\frac{R_{\max}}{1-\gamma}.

Episodic vs Continuing Tasks

📖Task Types
  • Episodic: Has a natural end (terminal state). Examples: games, robot reaching a goal
  • Continuing: Goes on forever. Examples: temperature control, stock trading

For episodic tasks, we can use γ=1\gamma = 1 since returns are naturally finite. For continuing tasks, we need γ<1\gamma < 1 to ensure convergence.

💡Tip

A common choice is γ=0.99\gamma = 0.99. It’s close enough to 1 that the agent cares about the future, but bounded enough for stable learning.

Putting It Together

</>Implementation

Here’s how we might represent a simple MDP in Python:

class MDP:
    """A simple Markov Decision Process."""

    def __init__(self, states, actions, transitions, rewards, gamma=0.99):
        self.states = states          # List of state names
        self.actions = actions        # List of action names
        self.transitions = transitions  # P[s][a] = [(s', prob), ...]
        self.rewards = rewards        # R[s] or R[s][a] -> float
        self.gamma = gamma

    def get_transitions(self, state, action):
        """Return list of (next_state, probability) tuples."""
        return self.transitions[state][action]

    def get_reward(self, state, action=None, next_state=None):
        """Return the reward for a transition."""
        if callable(self.rewards):
            return self.rewards(state, action, next_state)
        return self.rewards.get(state, 0)


# Example: Simple 3-state MDP
mdp = MDP(
    states=['start', 'middle', 'goal'],
    actions=['left', 'right'],
    transitions={
        'start': {
            'right': [('middle', 1.0)],
            'left': [('start', 1.0)],  # Wall
        },
        'middle': {
            'right': [('goal', 0.8), ('start', 0.2)],  # Stochastic!
            'left': [('start', 1.0)],
        },
        'goal': {
            'right': [('goal', 1.0)],  # Terminal
            'left': [('goal', 1.0)],
        },
    },
    rewards={'goal': 10, 'start': -1, 'middle': -1},
    gamma=0.99
)

Summary

Every MDP is defined by these five components:

States
S\mathcal{S}
Where am I?
Actions
A\mathcal{A}
What can I do?
Transitions
PP
Where might I end up?
Rewards
RR
How did that go?
Discount
γ\gamma
How much do I care about the future?

Together, they provide a complete mathematical description of a sequential decision problem. In the next section, we’ll explore the crucial Markov property that makes this framework tractable.