The Goal: Maximize Cumulative Reward
The agent’s objective isn’t to maximize immediate reward—it’s to maximize cumulative reward over time, also called the return.
The return is the total accumulated reward from time onward. This is what the agent tries to maximize—not just the next reward, but all future rewards combined.
A move that captures a pawn might look good now (+1 point). But if it leads to losing your queen (-9 points), the cumulative reward is deeply negative.
RL agents must learn to delay gratification. Sometimes the best immediate action is to invest in future rewards.
Formally, the agent tries to maximize the expected return:
Often we add discounting—valuing immediate rewards more than distant ones:
where (gamma) is the discount factor, typically 0.9–0.999
There’s another critical concept in RL: the exploration-exploitation tradeoff. Should the agent use what it knows (exploit) or try new things to learn more (explore)? This fundamental dilemma deserves its own section—we cover it next in Exploration vs Exploitation.