Markov Decision Processes • Part 2 of 3
📝Draft

Action Value Functions

How good is it to take an action in a state?

Action Value Functions

State values tell us how good it is to be somewhere. But to actually make decisions, we need to know how good each action is. That’s what action-value functions (Q-functions) provide.

The Q-function is perhaps the most important concept in reinforcement learning. It’s the foundation of Q-learning, DQN, and many other algorithms.

The Question Q-Values Answer

State values answer: “How good is this state?”

Action values answer a more directly useful question: “How good is this action in this state?”

📖Action-Value Function

The action-value function Qπ(s,a)Q^\pi(s,a) gives the expected return when starting in state ss, taking action aa, and then following policy π\pi thereafter:

Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

Q-values let you compare actions directly. If you’re in a state and wondering “should I go left or right?”, you can just look at the Q-values:

  • Q(s,left)=5.2Q(s, \text{left}) = 5.2
  • Q(s,right)=8.7Q(s, \text{right}) = 8.7

Right is better! The Q-values give you an immediate answer without needing to simulate forward or know the MDP dynamics.

Think of Q-values as an action rating system. Each action gets a score based on where it leads and what follows.

Why Q-Values Matter for Control

ℹ️Note

Here’s the key insight: V-values tell you where to be, but Q-values tell you what to do.

To use V-values for action selection, you need to know where each action leads (the transition probabilities). To use Q-values, you just need to compare numbers.

This is why Q-learning and its variants are so popular: they learn Q-values directly, enabling action selection without a model of the environment.

📌V-values vs Q-values for Decision Making

Suppose you’re in state ss with two actions available: left and right.

Using V-values (requires knowing the MDP):

  • Left leads to state s1s_1 with probability 0.8, state s2s_2 with probability 0.2
  • Right leads to state s3s_3 with probability 1.0
  • Must compute: Q(s,left)=R(s,left)+γ(0.8V(s1)+0.2V(s2))Q(s, \text{left}) = R(s, \text{left}) + \gamma(0.8 \cdot V(s_1) + 0.2 \cdot V(s_2))
  • Then compare with Q(s,right)=R(s,right)+γV(s3)Q(s, \text{right}) = R(s, \text{right}) + \gamma \cdot V(s_3)

Using Q-values (no model needed):

  • Look up Q(s,left)=7.3Q(s, \text{left}) = 7.3
  • Look up Q(s,right)=8.1Q(s, \text{right}) = 8.1
  • Right is better. Done!

Q-values have “pre-computed” the effect of the transitions.

The V-Q Relationship

State values and action values are intimately connected. Understanding this relationship is crucial for RL.

Mathematical Details

From Q to V:

Vπ(s)=aAπ(as)Qπ(s,a)V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \cdot Q^\pi(s, a)

The state value is the weighted average of action values, weighted by the policy’s action probabilities. If the policy always takes action aa^*, then V(s)=Q(s,a)V(s) = Q(s, a^*).

From V to Q:

Qπ(s,a)=R(s,a)+γsP(ss,a)Vπ(s)Q^\pi(s, a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \cdot V^\pi(s')

The action value is the immediate reward plus the discounted value of the expected next state. This requires knowing the transition probabilities P(ss,a)P(s'|s,a).

Think of it this way:

  • V-values compress your situation into one number
  • Q-values give you a number for each possible move from that situation
  • V(s)V(s) is like your “average prospects” from state ss
  • Q(s,a)Q(s,a) is your “prospects if you take this specific action”

If you know all the Q-values at a state, you can compute V by averaging. If you know V and the MDP dynamics, you can compute Q.

Computing Q-Values: A Worked Example

📌Q-Value Calculation in GridWorld

Consider a simple 2x2 GridWorld:

 _____ _____
|     |     |
| S   | +5  |
|_____|_____|
|     |     |
| -1  | G   |
|_____|_____|

S = Start (top-left)
G = Goal with +10 reward (bottom-right)
+5 = Bonus cell
-1 = Penalty cell
Step cost: -0.1
γ = 0.9

Let’s compute Q-values for state S under a given policy. From S, we can go right or down.

Q(S, right):

  • Immediate reward: 0.1-0.1 (step) +5+ 5 (bonus) =4.9= 4.9
  • Next state: top-right cell
  • From there, optimal play reaches goal in one step
  • V(top-right)0.1+0.9×10=8.9V(\text{top-right}) \approx -0.1 + 0.9 \times 10 = 8.9
  • So: Q(S,right)=4.9+0.9×8.9=12.91Q(S, \text{right}) = 4.9 + 0.9 \times 8.9 = 12.91

Q(S, down):

  • Immediate reward: 0.1-0.1 (step) 1- 1 (penalty) =1.1= -1.1
  • Next state: bottom-left cell
  • From there, optimal play reaches goal in one step
  • V(bottom-left)0.1+0.9×10=8.9V(\text{bottom-left}) \approx -0.1 + 0.9 \times 10 = 8.9
  • So: Q(S,down)=1.1+0.9×8.9=6.91Q(S, \text{down}) = -1.1 + 0.9 \times 8.9 = 6.91

Conclusion: Going right (Q=12.91Q = 12.91) is much better than going down (Q=6.91Q = 6.91). The bonus on the right path makes a big difference.

Q-Values for Different Policies

Like state values, Q-values depend on the policy being followed after the initial action.

📌Same Action, Different Q-Values

Consider state ss where action “right” leads to state ss', which has a choice between a safe path and a risky path.

Under a cautious policy (always takes safe path from ss'):

  • Qcautious(s,right)=R+γVcautious(s)=0+0.9×5=4.5Q^{\text{cautious}}(s, \text{right}) = R + \gamma \cdot V^{\text{cautious}}(s') = 0 + 0.9 \times 5 = 4.5

Under a risky policy (takes risky path from ss', 50% chance of +20, 50% chance of -10):

  • Qrisky(s,right)=R+γVrisky(s)=0+0.9×5=4.5Q^{\text{risky}}(s, \text{right}) = R + \gamma \cdot V^{\text{risky}}(s') = 0 + 0.9 \times 5 = 4.5

Wait, they’re the same! That’s because Vrisky(s)=0.5×20+0.5×(10)=5V^{\text{risky}}(s') = 0.5 \times 20 + 0.5 \times (-10) = 5 happens to equal Vcautious(s)V^{\text{cautious}}(s').

But if the risky path had different probabilities, the Q-values would differ even for the same initial action.

Q-Values and Optimal Action Selection

Here’s the beautiful result that makes Q-learning possible:

Mathematical Details

If you know Q(s,a)Q^*(s, a) (the optimal Q-values), finding the optimal action is trivial:

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a)

Just pick the action with the highest Q-value. No need to know transitions, no need to plan ahead. The Q-values have already captured everything about future rewards.

This is why so much of RL research focuses on learning good Q-value estimates.

Q-Values for Stochastic Environments

When the environment is stochastic, Q-values average over possible outcomes:

Mathematical Details

With stochastic transitions:

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γVπ(s)]Q^\pi(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s, a, s') + \gamma V^\pi(s') \right]

Or equivalently:

Qπ(s,a)=R(s,a)+γsP(ss,a)Vπ(s)Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')

The Q-value for an action is an expected value over where that action might lead.

📌Stochastic Q-Values

Action “attempt-right” in state ss:

  • 80% chance: actually move right (reward: +1, next state value: 10)
  • 20% chance: slip and stay in place (reward: 0, next state value: 5)

With γ=0.9\gamma = 0.9:

Q(s,attempt-right)=0.8×(1+0.9×10)+0.2×(0+0.9×5)Q(s, \text{attempt-right}) = 0.8 \times (1 + 0.9 \times 10) + 0.2 \times (0 + 0.9 \times 5) =0.8×10+0.2×4.5=8.9= 0.8 \times 10 + 0.2 \times 4.5 = 8.9

The Q-value captures the expected outcome, accounting for the uncertainty in the environment.

Implementation

</>Implementation

Here’s how to work with Q-values in Python:

import numpy as np
from collections import defaultdict

# Q-values as a nested dictionary: Q[state][action] = value
Q = defaultdict(lambda: defaultdict(float))

# Set some Q-values
Q['start']['left'] = 3.2
Q['start']['right'] = 7.8
Q['start']['up'] = 2.1

def get_best_action(Q, state):
    """Return the action with highest Q-value for a state."""
    actions = Q[state]
    if not actions:
        return None
    return max(actions, key=lambda a: actions[a])

def get_state_value(Q, state, policy=None):
    """
    Compute V(s) from Q-values.

    If policy is None, use greedy (max) policy.
    Otherwise, policy[state] should be a dict of {action: probability}.
    """
    actions = Q[state]
    if not actions:
        return 0.0

    if policy is None:
        # Greedy: V(s) = max_a Q(s,a)
        return max(actions.values())
    else:
        # Stochastic: V(s) = sum_a π(a|s) * Q(s,a)
        probs = policy.get(state, {})
        return sum(probs.get(a, 0) * q for a, q in actions.items())


# Example: Greedy action selection
best = get_best_action(Q, 'start')
print(f"Best action from 'start': {best}")  # right

# Example: State value under greedy policy
v = get_state_value(Q, 'start')
print(f"V(start) under greedy policy: {v}")  # 7.8

# Example: State value under uniform policy
uniform_policy = {
    'start': {'left': 1/3, 'right': 1/3, 'up': 1/3}
}
v_uniform = get_state_value(Q, 'start', uniform_policy)
print(f"V(start) under uniform policy: {v_uniform:.2f}")  # 4.37

Monte Carlo Q-Value Estimation

</>Implementation

Just as we can estimate V-values from episodes, we can estimate Q-values:

import numpy as np
from collections import defaultdict

def estimate_q_values_mc(episodes, gamma=0.99):
    """
    Estimate Q(s,a) using Monte Carlo from episodes.

    Each episode is a list of (state, action, reward) tuples.
    """
    returns = defaultdict(list)  # returns[(s,a)] = list of observed returns

    for episode in episodes:
        # Calculate returns for each (state, action) pair
        G = 0
        # Work backward through the episode
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = reward + gamma * G
            returns[(state, action)].append(G)

    # Average the returns for each (state, action)
    Q = defaultdict(dict)
    for (state, action), obs_returns in returns.items():
        Q[state][action] = np.mean(obs_returns)

    return Q


# Example episodes
episodes = [
    # Episode 1: start --right--> middle --right--> goal
    [('start', 'right', -1), ('middle', 'right', 10)],
    # Episode 2: start --down--> bottom --right--> goal
    [('start', 'down', -1), ('bottom', 'right', 10)],
    # Episode 3: start --right--> middle --right--> goal
    [('start', 'right', -1), ('middle', 'right', 10)],
]

Q = estimate_q_values_mc(episodes, gamma=0.9)

print("Estimated Q-values:")
for state in sorted(Q.keys()):
    for action, value in Q[state].items():
        print(f"  Q({state}, {action}) = {value:.2f}")

Output:

Estimated Q-values:
  Q(bottom, right) = 10.00
  Q(middle, right) = 10.00
  Q(start, down) = 8.00
  Q(start, right) = 8.00

Note: Q(start, right) = -1 + 0.9 * 10 = 8.0, which matches our estimate!

Q-Tables and Visualization

For small, discrete problems, Q-values are often stored in a Q-table:

</>Implementation
import numpy as np

class QTable:
    """A simple Q-table for discrete state and action spaces."""

    def __init__(self, n_states, n_actions, init_value=0.0):
        self.table = np.full((n_states, n_actions), init_value)

    def __getitem__(self, key):
        state, action = key
        return self.table[state, action]

    def __setitem__(self, key, value):
        state, action = key
        self.table[state, action] = value

    def get_best_action(self, state):
        """Return action with highest Q-value."""
        return np.argmax(self.table[state])

    def get_action_values(self, state):
        """Return Q-values for all actions in a state."""
        return self.table[state].copy()

    def visualize(self, state_names=None, action_names=None):
        """Print the Q-table nicely formatted."""
        n_states, n_actions = self.table.shape

        if state_names is None:
            state_names = [f"s{i}" for i in range(n_states)]
        if action_names is None:
            action_names = [f"a{i}" for i in range(n_actions)]

        # Header
        header = "State | " + " | ".join(f"{a:>8}" for a in action_names)
        print(header)
        print("-" * len(header))

        # Rows
        for i, state in enumerate(state_names):
            values = " | ".join(f"{self.table[i,j]:>8.2f}"
                              for j in range(n_actions))
            print(f"{state:>5} | {values}")


# Example: 4-state, 2-action Q-table
Q = QTable(4, 2)
Q[0, 0] = 5.2   # Q(state0, action0)
Q[0, 1] = 8.7   # Q(state0, action1)
Q[1, 0] = 3.1
Q[1, 1] = 3.5
Q[2, 0] = 6.0
Q[2, 1] = 5.8
Q[3, 0] = 9.9
Q[3, 1] = 10.0  # Goal state

Q.visualize(
    state_names=['Start', 'Left', 'Right', 'Goal'],
    action_names=['Left', 'Right']
)

Output:

State |     Left |    Right
-------------------------------
Start |     5.20 |     8.70
 Left |     3.10 |     3.50
Right |     6.00 |     5.80
 Goal |     9.90 |    10.00

The best action in each state is the one with the higher value (shown by comparing columns).

Common Patterns with Q-Values

💡Tip

Pattern 1: Greedy Policy Extraction

def greedy_policy(Q, state):
    return max(Q[state], key=lambda a: Q[state][a])

Pattern 2: Epsilon-Greedy Exploration

def epsilon_greedy(Q, state, epsilon=0.1):
    if np.random.random() < epsilon:
        return np.random.choice(list(Q[state].keys()))
    return max(Q[state], key=lambda a: Q[state][a])

Pattern 3: Softmax Action Selection

def softmax_policy(Q, state, temperature=1.0):
    actions = list(Q[state].keys())
    q_values = np.array([Q[state][a] for a in actions])
    probs = np.exp(q_values / temperature)
    probs /= probs.sum()
    return np.random.choice(actions, p=probs)

Why Q-Values Are Central to RL

ℹ️Note

Q-values are at the heart of most RL algorithms because:

  1. Model-free control: You can act optimally knowing only Q-values, no MDP model needed
  2. Direct optimization: Learning Q-values directly gives you a policy
  3. Off-policy learning: You can learn about optimal actions while exploring suboptimally
  4. Sample efficiency: Each experience updates Q-values for specific actions

Algorithms like Q-learning, SARSA, DQN, and many others are fundamentally about learning good Q-value estimates.

Summary

  • Action-value function Qπ(s,a)Q^\pi(s, a) measures expected return from taking action aa in state ss
  • Q-values enable direct action comparison without knowing the MDP
  • V-Q relationship: Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a) and Qπ(s,a)=R(s,a)+γsP(ss,a)Vπ(s)Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')
  • Q-values can be estimated from experience using Monte Carlo methods
  • For optimal action selection: just pick argmaxaQ(s,a)\arg\max_a Q^*(s,a)

We’ve seen how values work for a fixed policy. But what are the best possible values? That’s where optimal value functions come in.