Policies and Value Functions | The RL Framework

Two Ways to Encode Knowledge

How does an RL agent remember what it has learned? There are two fundamental representations: policies and value functions.

🎯

Policy

\pi(s)

\pi(a|s)

What to do in each situation

“In state A, go right. In state B, go up.”

📊

Value Function

V(s)

Q(s,a)

How good each situation is

“State A is worth 10. State B is worth 3.”

The Policy: Your Agent’s Playbook

📖Policy

A policy $\pi$ maps states to actions. It’s the agent’s strategy—given the current situation, what should I do?

📌GPS Navigation as a Policy

Your GPS has a policy: given your current location (state), it tells you which turn to make (action). The policy might be:

At Main St. → turn right
At Oak Ave. → go straight
At destination → stop

Policies can be:

Deterministic

\pi(s) = a

Always take the same action in the same state

Stochastic

\pi(a|s) = P(\text{action } a \text{ in state } s)

Probability distribution over actions—allows exploration

The Value Function: Rating States and Actions

📖Value Function

A value function estimates how good it is to be in a state (or take an action). “Good” means the expected cumulative reward from that point forward.

📌Chess Position Evaluation

Strong chess players can look at a board and estimate who’s winning. This is a value function—rating positions based on expected outcome. A position with material advantage, good piece activity, and king safety has high value.

There are two flavors:

V(s)

State Value

How good is it to be in this state?

”This chess position is worth +2 pawns”

Q(s,a)

Action Value (Q-value)

How good is it to take action $a$ in state

s

”Moving the knight here is worth +1.5 pawns”

ℹ️Q-Values Are Especially Useful

If you know $Q(s,a)$ for all actions, you can derive a policy: just pick the action with the highest Q-value. This is the foundation of Q-learning, which we’ll study in depth later.

The Connection: Policy ↔ Value

Policies and values are deeply connected:

Policy

\pi

What to do

evaluate→

←improve

Value

V

Q

How good it is

↻ This loop drives learning in many RL algorithms

Policy → Value

Given a policy, we can evaluate it by computing the value of each state under that policy.

Value → Policy

Given value estimates, we can improve the policy by choosing actions that lead to higher-value states.

This evaluate-improve cycle is at the heart of many RL algorithms. We’ll see it again and again.

💡Different Algorithms, Different Approaches

Value-based methods (like Q-learning): Learn values, derive policy from them
Policy-based methods (like REINFORCE): Learn policy directly, skip values
Actor-Critic methods: Learn both simultaneously

We’ll explore all three approaches in this book.