Two Ways to Encode Knowledge
How does an RL agent remember what it has learned? There are two fundamental representations: policies and value functions.
What to do in each situation
“In state A, go right. In state B, go up.”
How good each situation is
“State A is worth 10. State B is worth 3.”
The Policy: Your Agent’s Playbook
A policy maps states to actions. It’s the agent’s strategy—given the current situation, what should I do?
Your GPS has a policy: given your current location (state), it tells you which turn to make (action). The policy might be:
- At Main St. → turn right
- At Oak Ave. → go straight
- At destination → stop
Policies can be:
The Value Function: Rating States and Actions
A value function estimates how good it is to be in a state (or take an action). “Good” means the expected cumulative reward from that point forward.
Strong chess players can look at a board and estimate who’s winning. This is a value function—rating positions based on expected outcome. A position with material advantage, good piece activity, and king safety has high value.
There are two flavors:
If you know for all actions, you can derive a policy: just pick the action with the highest Q-value. This is the foundation of Q-learning, which we’ll study in depth later.
The Connection: Policy ↔ Value
Policies and values are deeply connected:
↻ This loop drives learning in many RL algorithms
Given a policy, we can evaluate it by computing the value of each state under that policy.
Given value estimates, we can improve the policy by choosing actions that lead to higher-value states.
This evaluate-improve cycle is at the heart of many RL algorithms. We’ll see it again and again.
- Value-based methods (like Q-learning): Learn values, derive policy from them
- Policy-based methods (like REINFORCE): Learn policy directly, skip values
- Actor-Critic methods: Learn both simultaneously
We’ll explore all three approaches in this book.