Chapter 12
📝Draft

Value Functions

Measuring how good states and actions are

Prerequisites:

Value Functions

What You'll Learn

  • Define state-value and action-value functions
  • Compute values for simple MDPs by hand
  • Explain the relationship between V(s) and Q(s,a)
  • Define optimal value functions V* and Q*
  • Explain why knowing optimal values lets us derive optimal policies

Now that we know how to describe sequential decision problems using MDPs, we need a way to measure success. How good is it to be in a particular state? How good is a particular action? Value functions answer these questions—and they’re the key to finding optimal behavior.

The Central Question

Consider a robot navigating a building, an agent playing a game, or an algorithm making trading decisions. At any moment, they face the question: “How am I doing?”

This seems simple, but answering it requires thinking about the future. A position that looks good now might lead to disaster later. A sacrifice today might pay off handsomely tomorrow.

Imagine playing chess. You look at the board and think: “Am I winning or losing?”

That intuitive assessment—collapsing all future possibilities into a single judgment—is exactly what value functions do. They tell you how “good” a position is, accounting for everything that might happen next.

📖Value Function

A function that maps states (or state-action pairs) to the expected cumulative reward an agent can achieve from that point forward, following a particular policy.

Value functions compress the infinite complexity of possible futures into a single number. If you know V(s)V^*(s), you know everything you need about the long-term prospects of being in state ss. This compression is what makes planning tractable.

Two Types of Value Functions

We’ll study two related but distinct value functions:

State-Value Function Vπ(s)V^\pi(s)

How good is it to be in this state?

Expected return when starting from state ss and following policy π\pi

Action-Value Function Qπ(s,a)Q^\pi(s,a)

How good is it to take this action in this state?

Expected return when taking action aa in state ss, then following π\pi

The key difference: V-values tell you where to be, Q-values tell you what to do. This makes Q-values more directly useful for decision-making.

Values Depend on Policy

A crucial insight: the value of a state depends on the policy. The same state can have very different values under different policies.

📌Policy Matters

In a simple navigation task, consider being 5 steps from a goal worth +10:

Under an optimal policy (always moves toward goal):

  • Value is approximately +10×0.955.9+10 \times 0.9^5 \approx 5.9 (discounted goal reward)

Under a random policy (wanders aimlessly):

  • Value might be 20-20 (accumulates step penalties while wandering)

Same state, vastly different values. The policy determines the value.

Why Values Matter

ℹ️Note

Value functions are predictions. They predict the expected cumulative reward. Good predictions enable good decisions.

If you know the value of every state you could end up in, you can evaluate any policy and make intelligent choices:

  1. Evaluation: Compute VπV^\pi to see how good policy π\pi is
  2. Improvement: Use values to find better policies
  3. Control: Act to maximize long-term value

This is the foundation of most RL algorithms.

Chapter Overview

This chapter covers value functions in depth:

The Key Equations

Mathematical Details

State-value function: Vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \bigg| S_t = s\right]

Action-value function: Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

V-Q relationship: Vπ(s)=aAπ(as)Qπ(s,a)V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) Q^\pi(s, a)

Optimal values: V(s)=maxπVπ(s)Q(s,a)=maxπQπ(s,a)V^*(s) = \max_\pi V^\pi(s) \qquad Q^*(s, a) = \max_\pi Q^\pi(s, a)

Optimal policy from Q*: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a)

Visual Preview: The Value Landscape

In a GridWorld, value functions create a “landscape” that reveals the structure of the problem:

📌GridWorld Values

Consider a 4x4 grid with a goal in the corner. Under an optimal policy:

Values (higher = better):
 _____ _____ _____ _____
|     |     |     |     |
| 4.1 | 4.6 | 5.1 | 5.7 |
|_____|_____|_____|_____|
|     |     |     |     |
| 4.6 |  X  | 5.7 | 6.3 |
|_____|_____|_____|_____|
|     |     |     |     |
| 5.1 | 5.7 | 6.3 | 7.0 |
|_____|_____|_____|_____|
|     |     |     |     |
| 5.7 | 6.3 | 7.0 | 10  |
|_____|_____|_____|_____|
                    Goal

The values increase as you approach the goal, forming a gradient that “points” toward the reward. Following this gradient is essentially what an optimal policy does.

What Comes Next

After understanding value functions, the natural question is: How do we compute them?

The answer lies in the Bellman equations, which express a beautiful recursive structure: the value of a state depends on the values of its successor states. This recursion enables efficient computation through dynamic programming, and forms the foundation of virtually all RL algorithms.

💡Tip

The progression of ideas in RL follows a clear path:

  1. MDPs define the problem
  2. Value functions measure quality
  3. Bellman equations enable computation
  4. Algorithms (DP, TD, Q-learning) solve for values

Key Takeaways

  • State-value function Vπ(s)V^\pi(s): expected return from state ss under policy π\pi
  • Action-value function Qπ(s,a)Q^\pi(s,a): expected return from taking action aa in state ss
  • The two are related: Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)
  • Optimal values VV^* and QQ^* give the best possible performance
  • Knowing QQ^* is enough to act optimally: just pick argmaxaQ(s,a)\arg\max_a Q^*(s,a)
  • Value functions are the foundation of most RL algorithms
Next ChapterThe Bellman Equations