Value Functions
What You'll Learn
- Define state-value and action-value functions
- Compute values for simple MDPs by hand
- Explain the relationship between V(s) and Q(s,a)
- Define optimal value functions V* and Q*
- Explain why knowing optimal values lets us derive optimal policies
Now that we know how to describe sequential decision problems using MDPs, we need a way to measure success. How good is it to be in a particular state? How good is a particular action? Value functions answer these questions—and they’re the key to finding optimal behavior.
The Central Question
Consider a robot navigating a building, an agent playing a game, or an algorithm making trading decisions. At any moment, they face the question: “How am I doing?”
This seems simple, but answering it requires thinking about the future. A position that looks good now might lead to disaster later. A sacrifice today might pay off handsomely tomorrow.
Imagine playing chess. You look at the board and think: “Am I winning or losing?”
That intuitive assessment—collapsing all future possibilities into a single judgment—is exactly what value functions do. They tell you how “good” a position is, accounting for everything that might happen next.
A function that maps states (or state-action pairs) to the expected cumulative reward an agent can achieve from that point forward, following a particular policy.
Value functions compress the infinite complexity of possible futures into a single number. If you know , you know everything you need about the long-term prospects of being in state . This compression is what makes planning tractable.
Two Types of Value Functions
We’ll study two related but distinct value functions:
State-Value Function
How good is it to be in this state?
Expected return when starting from state and following policy
Action-Value Function
How good is it to take this action in this state?
Expected return when taking action in state , then following
The key difference: V-values tell you where to be, Q-values tell you what to do. This makes Q-values more directly useful for decision-making.
Values Depend on Policy
A crucial insight: the value of a state depends on the policy. The same state can have very different values under different policies.
In a simple navigation task, consider being 5 steps from a goal worth +10:
Under an optimal policy (always moves toward goal):
- Value is approximately (discounted goal reward)
Under a random policy (wanders aimlessly):
- Value might be (accumulates step penalties while wandering)
Same state, vastly different values. The policy determines the value.
Why Values Matter
Value functions are predictions. They predict the expected cumulative reward. Good predictions enable good decisions.
If you know the value of every state you could end up in, you can evaluate any policy and make intelligent choices:
- Evaluation: Compute to see how good policy is
- Improvement: Use values to find better policies
- Control: Act to maximize long-term value
This is the foundation of most RL algorithms.
Chapter Overview
This chapter covers value functions in depth:
State Value Functions
The formal definition of , how values depend on policy, and how to compute them
Action Value Functions
The Q-function , why it’s more useful for control, and the V-Q relationship
Optimal Value Functions
The best possible values and , and how they lead to optimal policies
The Key Equations
State-value function:
Action-value function:
V-Q relationship:
Optimal values:
Optimal policy from Q*:
Visual Preview: The Value Landscape
In a GridWorld, value functions create a “landscape” that reveals the structure of the problem:
Consider a 4x4 grid with a goal in the corner. Under an optimal policy:
Values (higher = better):
_____ _____ _____ _____
| | | | |
| 4.1 | 4.6 | 5.1 | 5.7 |
|_____|_____|_____|_____|
| | | | |
| 4.6 | X | 5.7 | 6.3 |
|_____|_____|_____|_____|
| | | | |
| 5.1 | 5.7 | 6.3 | 7.0 |
|_____|_____|_____|_____|
| | | | |
| 5.7 | 6.3 | 7.0 | 10 |
|_____|_____|_____|_____|
GoalThe values increase as you approach the goal, forming a gradient that “points” toward the reward. Following this gradient is essentially what an optimal policy does.
What Comes Next
After understanding value functions, the natural question is: How do we compute them?
The answer lies in the Bellman equations, which express a beautiful recursive structure: the value of a state depends on the values of its successor states. This recursion enables efficient computation through dynamic programming, and forms the foundation of virtually all RL algorithms.
The progression of ideas in RL follows a clear path:
- MDPs define the problem
- Value functions measure quality
- Bellman equations enable computation
- Algorithms (DP, TD, Q-learning) solve for values
Key Takeaways
- State-value function : expected return from state under policy
- Action-value function : expected return from taking action in state
- The two are related:
- Optimal values and give the best possible performance
- Knowing is enough to act optimally: just pick
- Value functions are the foundation of most RL algorithms