Rewards and Returns | The RL Framework

The Goal: Maximize Cumulative Reward

The agent’s objective isn’t to maximize immediate reward—it’s to maximize cumulative reward over time, also called the return.

📖Return

The return $G_t$ is the total accumulated reward from time $t$ onward. This is what the agent tries to maximize—not just the next reward, but all future rewards combined.

📌Chess: Why Immediate Reward Misleads

A move that captures a pawn might look good now (+1 point). But if it leads to losing your queen (-9 points), the cumulative reward is deeply negative.

RL agents must learn to delay gratification. Sometimes the best immediate action is to invest in future rewards.

∑Mathematical Details

Formally, the agent tries to maximize the expected return:

Simple Return (sum of all future rewards)

G_t

R_{t+1} + R_{t+2} + R_{t+3} + \ldots

Often we add discounting—valuing immediate rewards more than distant ones:

Discounted Return

G_t

R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots

where $\gamma$ (gamma) is the discount factor, typically 0.9–0.999

\gamma = 0

Completely myopic

Only cares about immediate reward

\gamma = 0.99

Far-sighted

Values future almost as much as present

\gamma = 1

No discounting

All rewards weighted equally

ℹ️What About Exploration?

There’s another critical concept in RL: the exploration-exploitation tradeoff. Should the agent use what it knows (exploit) or try new things to learn more (explore)? This fundamental dilemma deserves its own section—we cover it next in Exploration vs Exploitation.