Action Value Functions
State values tell us how good it is to be somewhere. But to actually make decisions, we need to know how good each action is. That’s what action-value functions (Q-functions) provide.
The Q-function is perhaps the most important concept in reinforcement learning. It’s the foundation of Q-learning, DQN, and many other algorithms.
The Question Q-Values Answer
State values answer: “How good is this state?”
Action values answer a more directly useful question: “How good is this action in this state?”
The action-value function gives the expected return when starting in state , taking action , and then following policy thereafter:
Q-values let you compare actions directly. If you’re in a state and wondering “should I go left or right?”, you can just look at the Q-values:
Right is better! The Q-values give you an immediate answer without needing to simulate forward or know the MDP dynamics.
Think of Q-values as an action rating system. Each action gets a score based on where it leads and what follows.
Why Q-Values Matter for Control
Here’s the key insight: V-values tell you where to be, but Q-values tell you what to do.
To use V-values for action selection, you need to know where each action leads (the transition probabilities). To use Q-values, you just need to compare numbers.
This is why Q-learning and its variants are so popular: they learn Q-values directly, enabling action selection without a model of the environment.
Suppose you’re in state with two actions available: left and right.
Using V-values (requires knowing the MDP):
- Left leads to state with probability 0.8, state with probability 0.2
- Right leads to state with probability 1.0
- Must compute:
- Then compare with
Using Q-values (no model needed):
- Look up
- Look up
- Right is better. Done!
Q-values have “pre-computed” the effect of the transitions.
The V-Q Relationship
State values and action values are intimately connected. Understanding this relationship is crucial for RL.
From Q to V:
The state value is the weighted average of action values, weighted by the policy’s action probabilities. If the policy always takes action , then .
From V to Q:
The action value is the immediate reward plus the discounted value of the expected next state. This requires knowing the transition probabilities .
Think of it this way:
- V-values compress your situation into one number
- Q-values give you a number for each possible move from that situation
- is like your “average prospects” from state
- is your “prospects if you take this specific action”
If you know all the Q-values at a state, you can compute V by averaging. If you know V and the MDP dynamics, you can compute Q.
Computing Q-Values: A Worked Example
Consider a simple 2x2 GridWorld:
_____ _____
| | |
| S | +5 |
|_____|_____|
| | |
| -1 | G |
|_____|_____|
S = Start (top-left)
G = Goal with +10 reward (bottom-right)
+5 = Bonus cell
-1 = Penalty cell
Step cost: -0.1
γ = 0.9Let’s compute Q-values for state S under a given policy. From S, we can go right or down.
Q(S, right):
- Immediate reward: (step) (bonus)
- Next state: top-right cell
- From there, optimal play reaches goal in one step
- So:
Q(S, down):
- Immediate reward: (step) (penalty)
- Next state: bottom-left cell
- From there, optimal play reaches goal in one step
- So:
Conclusion: Going right () is much better than going down (). The bonus on the right path makes a big difference.
Q-Values for Different Policies
Like state values, Q-values depend on the policy being followed after the initial action.
is the value of taking action and then following . Even if is a bad action under , we’re forced to take it first. The policy only controls what happens afterwards.
Consider state where action “right” leads to state , which has a choice between a safe path and a risky path.
Under a cautious policy (always takes safe path from ):
Under a risky policy (takes risky path from , 50% chance of +20, 50% chance of -10):
Wait, they’re the same! That’s because happens to equal .
But if the risky path had different probabilities, the Q-values would differ even for the same initial action.
Q-Values and Optimal Action Selection
Here’s the beautiful result that makes Q-learning possible:
If you know (the optimal Q-values), finding the optimal action is trivial:
Just pick the action with the highest Q-value. No need to know transitions, no need to plan ahead. The Q-values have already captured everything about future rewards.
This is why so much of RL research focuses on learning good Q-value estimates.
Q-Values for Stochastic Environments
When the environment is stochastic, Q-values average over possible outcomes:
With stochastic transitions:
Or equivalently:
The Q-value for an action is an expected value over where that action might lead.
Action “attempt-right” in state :
- 80% chance: actually move right (reward: +1, next state value: 10)
- 20% chance: slip and stay in place (reward: 0, next state value: 5)
With :
The Q-value captures the expected outcome, accounting for the uncertainty in the environment.
Implementation
Here’s how to work with Q-values in Python:
import numpy as np
from collections import defaultdict
# Q-values as a nested dictionary: Q[state][action] = value
Q = defaultdict(lambda: defaultdict(float))
# Set some Q-values
Q['start']['left'] = 3.2
Q['start']['right'] = 7.8
Q['start']['up'] = 2.1
def get_best_action(Q, state):
"""Return the action with highest Q-value for a state."""
actions = Q[state]
if not actions:
return None
return max(actions, key=lambda a: actions[a])
def get_state_value(Q, state, policy=None):
"""
Compute V(s) from Q-values.
If policy is None, use greedy (max) policy.
Otherwise, policy[state] should be a dict of {action: probability}.
"""
actions = Q[state]
if not actions:
return 0.0
if policy is None:
# Greedy: V(s) = max_a Q(s,a)
return max(actions.values())
else:
# Stochastic: V(s) = sum_a π(a|s) * Q(s,a)
probs = policy.get(state, {})
return sum(probs.get(a, 0) * q for a, q in actions.items())
# Example: Greedy action selection
best = get_best_action(Q, 'start')
print(f"Best action from 'start': {best}") # right
# Example: State value under greedy policy
v = get_state_value(Q, 'start')
print(f"V(start) under greedy policy: {v}") # 7.8
# Example: State value under uniform policy
uniform_policy = {
'start': {'left': 1/3, 'right': 1/3, 'up': 1/3}
}
v_uniform = get_state_value(Q, 'start', uniform_policy)
print(f"V(start) under uniform policy: {v_uniform:.2f}") # 4.37Monte Carlo Q-Value Estimation
Just as we can estimate V-values from episodes, we can estimate Q-values:
import numpy as np
from collections import defaultdict
def estimate_q_values_mc(episodes, gamma=0.99):
"""
Estimate Q(s,a) using Monte Carlo from episodes.
Each episode is a list of (state, action, reward) tuples.
"""
returns = defaultdict(list) # returns[(s,a)] = list of observed returns
for episode in episodes:
# Calculate returns for each (state, action) pair
G = 0
# Work backward through the episode
for t in range(len(episode) - 1, -1, -1):
state, action, reward = episode[t]
G = reward + gamma * G
returns[(state, action)].append(G)
# Average the returns for each (state, action)
Q = defaultdict(dict)
for (state, action), obs_returns in returns.items():
Q[state][action] = np.mean(obs_returns)
return Q
# Example episodes
episodes = [
# Episode 1: start --right--> middle --right--> goal
[('start', 'right', -1), ('middle', 'right', 10)],
# Episode 2: start --down--> bottom --right--> goal
[('start', 'down', -1), ('bottom', 'right', 10)],
# Episode 3: start --right--> middle --right--> goal
[('start', 'right', -1), ('middle', 'right', 10)],
]
Q = estimate_q_values_mc(episodes, gamma=0.9)
print("Estimated Q-values:")
for state in sorted(Q.keys()):
for action, value in Q[state].items():
print(f" Q({state}, {action}) = {value:.2f}")Output:
Estimated Q-values:
Q(bottom, right) = 10.00
Q(middle, right) = 10.00
Q(start, down) = 8.00
Q(start, right) = 8.00Note: Q(start, right) = -1 + 0.9 * 10 = 8.0, which matches our estimate!
Q-Tables and Visualization
For small, discrete problems, Q-values are often stored in a Q-table:
import numpy as np
class QTable:
"""A simple Q-table for discrete state and action spaces."""
def __init__(self, n_states, n_actions, init_value=0.0):
self.table = np.full((n_states, n_actions), init_value)
def __getitem__(self, key):
state, action = key
return self.table[state, action]
def __setitem__(self, key, value):
state, action = key
self.table[state, action] = value
def get_best_action(self, state):
"""Return action with highest Q-value."""
return np.argmax(self.table[state])
def get_action_values(self, state):
"""Return Q-values for all actions in a state."""
return self.table[state].copy()
def visualize(self, state_names=None, action_names=None):
"""Print the Q-table nicely formatted."""
n_states, n_actions = self.table.shape
if state_names is None:
state_names = [f"s{i}" for i in range(n_states)]
if action_names is None:
action_names = [f"a{i}" for i in range(n_actions)]
# Header
header = "State | " + " | ".join(f"{a:>8}" for a in action_names)
print(header)
print("-" * len(header))
# Rows
for i, state in enumerate(state_names):
values = " | ".join(f"{self.table[i,j]:>8.2f}"
for j in range(n_actions))
print(f"{state:>5} | {values}")
# Example: 4-state, 2-action Q-table
Q = QTable(4, 2)
Q[0, 0] = 5.2 # Q(state0, action0)
Q[0, 1] = 8.7 # Q(state0, action1)
Q[1, 0] = 3.1
Q[1, 1] = 3.5
Q[2, 0] = 6.0
Q[2, 1] = 5.8
Q[3, 0] = 9.9
Q[3, 1] = 10.0 # Goal state
Q.visualize(
state_names=['Start', 'Left', 'Right', 'Goal'],
action_names=['Left', 'Right']
)Output:
State | Left | Right
-------------------------------
Start | 5.20 | 8.70
Left | 3.10 | 3.50
Right | 6.00 | 5.80
Goal | 9.90 | 10.00The best action in each state is the one with the higher value (shown by comparing columns).
Common Patterns with Q-Values
Pattern 1: Greedy Policy Extraction
def greedy_policy(Q, state):
return max(Q[state], key=lambda a: Q[state][a])Pattern 2: Epsilon-Greedy Exploration
def epsilon_greedy(Q, state, epsilon=0.1):
if np.random.random() < epsilon:
return np.random.choice(list(Q[state].keys()))
return max(Q[state], key=lambda a: Q[state][a])Pattern 3: Softmax Action Selection
def softmax_policy(Q, state, temperature=1.0):
actions = list(Q[state].keys())
q_values = np.array([Q[state][a] for a in actions])
probs = np.exp(q_values / temperature)
probs /= probs.sum()
return np.random.choice(actions, p=probs)Why Q-Values Are Central to RL
Q-values are at the heart of most RL algorithms because:
- Model-free control: You can act optimally knowing only Q-values, no MDP model needed
- Direct optimization: Learning Q-values directly gives you a policy
- Off-policy learning: You can learn about optimal actions while exploring suboptimally
- Sample efficiency: Each experience updates Q-values for specific actions
Algorithms like Q-learning, SARSA, DQN, and many others are fundamentally about learning good Q-value estimates.
Summary
- Action-value function measures expected return from taking action in state
- Q-values enable direct action comparison without knowing the MDP
- V-Q relationship: and
- Q-values can be estimated from experience using Monte Carlo methods
- For optimal action selection: just pick
We’ve seen how values work for a fixed policy. But what are the best possible values? That’s where optimal value functions come in.