Foundations • Part 3 of 4
Editor Reviewed

Policies and Value Functions

How agents represent knowledge

Two Ways to Encode Knowledge

How does an RL agent remember what it has learned? There are two fundamental representations: policies and value functions.

🎯
Policy
π(s)\pi(s) or π(as)\pi(a|s)

What to do in each situation

“In state A, go right. In state B, go up.”

📊
Value Function
V(s)V(s) or Q(s,a)Q(s,a)

How good each situation is

“State A is worth 10. State B is worth 3.”

The Policy: Your Agent’s Playbook

📖Policy

A policy π\pi maps states to actions. It’s the agent’s strategy—given the current situation, what should I do?

📌GPS Navigation as a Policy

Your GPS has a policy: given your current location (state), it tells you which turn to make (action). The policy might be:

  • At Main St. → turn right
  • At Oak Ave. → go straight
  • At destination → stop

Policies can be:

Deterministic
π(s)=a\pi(s) = a
Always take the same action in the same state
Stochastic
π(as)=P(action a in state s)\pi(a|s) = P(\text{action } a \text{ in state } s)
Probability distribution over actions—allows exploration

The Value Function: Rating States and Actions

📖Value Function

A value function estimates how good it is to be in a state (or take an action). “Good” means the expected cumulative reward from that point forward.

📌Chess Position Evaluation

Strong chess players can look at a board and estimate who’s winning. This is a value function—rating positions based on expected outcome. A position with material advantage, good piece activity, and king safety has high value.

There are two flavors:

V(s)V(s)
State Value
How good is it to be in this state?
”This chess position is worth +2 pawns”
Q(s,a)Q(s,a)
Action Value (Q-value)
How good is it to take action aa in state ss?
”Moving the knight here is worth +1.5 pawns”
ℹ️Q-Values Are Especially Useful

If you know Q(s,a)Q(s,a) for all actions, you can derive a policy: just pick the action with the highest Q-value. This is the foundation of Q-learning, which we’ll study in depth later.

The Connection: Policy ↔ Value

Policies and values are deeply connected:

Policy π\pi
What to do
evaluate
improve
Value VV or QQ
How good it is

↻ This loop drives learning in many RL algorithms

Policy → Value

Given a policy, we can evaluate it by computing the value of each state under that policy.

Value → Policy

Given value estimates, we can improve the policy by choosing actions that lead to higher-value states.

This evaluate-improve cycle is at the heart of many RL algorithms. We’ll see it again and again.

💡Different Algorithms, Different Approaches
  • Value-based methods (like Q-learning): Learn values, derive policy from them
  • Policy-based methods (like REINFORCE): Learn policy directly, skip values
  • Actor-Critic methods: Learn both simultaneously

We’ll explore all three approaches in this book.