Foundations • Part 2 of 4
Editor Reviewed

Rewards and Returns

Defining goals through reward signals

The Goal: Maximize Cumulative Reward

The agent’s objective isn’t to maximize immediate reward—it’s to maximize cumulative reward over time, also called the return.

📖Return

The return GtG_t is the total accumulated reward from time tt onward. This is what the agent tries to maximize—not just the next reward, but all future rewards combined.

📌Chess: Why Immediate Reward Misleads

A move that captures a pawn might look good now (+1 point). But if it leads to losing your queen (-9 points), the cumulative reward is deeply negative.

RL agents must learn to delay gratification. Sometimes the best immediate action is to invest in future rewards.

Mathematical Details

Formally, the agent tries to maximize the expected return:

Simple Return (sum of all future rewards)
GtG_t=Rt+1+Rt+2+Rt+3+R_{t+1} + R_{t+2} + R_{t+3} + \ldots

Often we add discounting—valuing immediate rewards more than distant ones:

Discounted Return
GtG_t=Rt+1+γRt+2+γ2Rt+3+R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots

where γ\gamma (gamma) is the discount factor, typically 0.9–0.999

γ=0\gamma = 0
Completely myopic
Only cares about immediate reward
γ=0.99\gamma = 0.99
Far-sighted
Values future almost as much as present
γ=1\gamma = 1
No discounting
All rewards weighted equally
ℹ️What About Exploration?

There’s another critical concept in RL: the exploration-exploitation tradeoff. Should the agent use what it knows (exploit) or try new things to learn more (explore)? This fundamental dilemma deserves its own section—we cover it next in Exploration vs Exploitation.