Why Context Matters
In multi-armed bandits, we searched for a single best arm—one action that’s optimal for everyone. But many real problems have a crucial missing ingredient: context. Different situations call for different decisions.
A news website shouldn’t show the same headline to everyone. A sports fan wants game scores; a tech enthusiast wants startup news. The best arm depends on who’s asking.
Welcome to contextual bandits—where we learn to personalize decisions.
The Limitation of One-Size-Fits-All
Consider running a news website. You have 5 headline slots and need to choose which article to feature. In the standard bandit framework:
- Run the site for a week
- Track which headlines get the most clicks
- Converge on the “best” headline
- Show it to everyone
But this ignores a massive opportunity. Different users have different interests:
| User Type | Best Article Category | Click Rate on Sports | Click Rate on Tech |
|---|---|---|---|
| Sports fan | Sports | 80% | 10% |
| Tech enthusiast | Tech | 20% | 70% |
| General reader | Either | 40% | 40% |
A standard bandit might conclude “Sports headlines get 50% clicks overall, Tech gets 40%—show Sports to everyone.”
But if you could identify the user type first, you could show Sports to sports fans (80% clicks) and Tech to tech enthusiasts (70% clicks). That’s much better than 50% for everyone!
Let’s quantify the opportunity cost:
Standard Bandit (one arm for all):
- Population: 50% sports fans, 50% tech enthusiasts
- Best overall: Sports (50% sports fans x 80% + 50% tech enthusiasts x 20% = 50% average CTR)
- Show Sports to everyone: 50% CTR
Contextual Bandit (personalized):
- Sports fans see Sports: 80% CTR
- Tech enthusiasts see Tech: 70% CTR
- Overall: 50% x 80% + 50% x 70% = 75% CTR
By using context, we improved CTR from 50% to 75%—a 50% relative improvement! This is the power of personalization.
Introducing Context
A vector of features that describes the current situation. Context is observed before making a decision and helps predict which action will be best.
Context is any information available before you choose an action:
User context (who is this?):
- Demographics (age, location)
- Past behavior (browsing history, purchase history)
- Device type (mobile, desktop)
- Time of visit
Situational context (what’s happening?):
- Time of day, day of week
- Current weather
- Trending topics
- Recent site activity
Item context (what are the options?):
- Article category, length, recency
- Product features, price, ratings
- Ad creative details
The key insight: context helps us predict how good each action will be for this specific situation.
The Contextual Bandit Framework
A contextual bandit problem consists of:
- Context space : The set of possible contexts
- Action space : The set of arms/actions
- Reward function: For each arm , there’s a function that determines expected reward given context
At each round :
- Environment generates context
- Agent observes and selects action
- Agent receives reward where is noise
The goal is to maximize cumulative reward:
Or equivalently, minimize regret relative to always picking the best arm for each context:
The key difference from standard bandits:
- Standard bandit: Find the single best arm
- Contextual bandit: Find the best arm for each context
We’re learning a policy that maps contexts to actions, not just picking a single action.
Linear Reward Models
How do we model the relationship between context and reward? The simplest approach: assume rewards are linear in the features.
where is a weight vector for arm .
For example, if context is :
- Sports arm might have
- Tech arm might have
Then a 25-year-old on mobile at noon would have:
- Sports expected reward:
- Tech expected reward:
Sports is predicted to be better for this user.
Formally, we assume:
This is the linear reward model. Each arm has its own parameter vector , and we learn these from data.
Given observations , we can estimate using ridge regression on the data where arm was selected:
The regularization term prevents overfitting when we have few samples.
import numpy as np
class ContextualBandit:
"""
A contextual bandit environment with linear rewards.
Each arm has a true weight vector. Rewards are linear
in features plus Gaussian noise.
"""
def __init__(self, n_arms: int, n_features: int, noise_std: float = 0.1):
self.n_arms = n_arms
self.n_features = n_features
self.noise_std = noise_std
# True weight vectors (unknown to the agent)
self.theta = np.random.randn(n_arms, n_features)
# Normalize so rewards are roughly in [0, 1]
self.theta = self.theta / (n_features ** 0.5)
def get_context(self) -> np.ndarray:
"""Generate a random context vector."""
return np.random.randn(self.n_features)
def get_reward(self, context: np.ndarray, arm: int) -> float:
"""Return noisy reward for selecting arm given context."""
expected = context @ self.theta[arm]
return expected + np.random.randn() * self.noise_std
def get_optimal_arm(self, context: np.ndarray) -> int:
"""Return the best arm for this context (for evaluation)."""
expected_rewards = context @ self.theta.T
return np.argmax(expected_rewards)
def get_optimal_reward(self, context: np.ndarray) -> float:
"""Return the expected reward of the optimal arm."""
expected_rewards = context @ self.theta.T
return np.max(expected_rewards)
# Example usage
bandit = ContextualBandit(n_arms=5, n_features=10)
# Simulate some interactions
for _ in range(10):
context = bandit.get_context()
optimal = bandit.get_optimal_arm(context)
rewards = [bandit.get_reward(context, a) for a in range(5)]
print(f"Optimal arm: {optimal}, Rewards: {[f'{r:.2f}' for r in rewards]}")Context is NOT State
Don’t Confuse Context with State
A crucial distinction:
- Context is observed before the action and doesn’t change due to your action
- State (in full RL) changes as a result of your actions
In contextual bandits:
- A user arrives with context (their profile)
- You show them an article
- They click or don’t
- The user leaves—your action didn’t change anything about future users
In full RL (MDPs):
- You’re in state
- You take action
- You transition to a new state that depends on your action
- Future rewards depend on where your action took you
Contextual bandits are “one-shot” decisions. If your recommendations affect future user preferences (filter bubbles), you need full RL.
Here’s the key test: Does your action affect the next context you see?
Contextual bandit (action doesn’t affect future):
- Showing an ad to a user
- Recommending a movie to a random visitor
- Selecting a news headline
- Choosing a treatment for independent patients
Full RL (action affects future):
- Playing a game (moves change the board)
- Robot navigation (position changes)
- Dialogue systems (conversation evolves)
- Long-term recommendations (preferences shift)
Why Not Separate Bandits?
You might wonder: “Why not just run a separate bandit for each possible context?”
The problem is context is usually continuous or high-dimensional. If your context is a 100-dimensional user embedding, you’ll never see the exact same context twice!
We need to generalize across contexts. If two users have similar features, their preferences are probably similar. This is where function approximation (like linear models) comes in—we learn smooth relationships between context and reward.
Suppose you have:
- 1 million possible users (contexts)
- 10 arms
- 10,000 total impressions
Separate bandits approach:
- Each user-arm combination needs its own estimate
- 10 million parameters to learn
- Most users seen only once—no learning possible!
Contextual bandit with 50-dimensional features:
- 10 arms x 50 features = 500 parameters total
- Learning generalizes across similar users
- Even users seen once benefit from information about similar users
The linear model lets us transfer knowledge: “Users who like sports also tend to like fitness articles” is captured in the learned weights.
Exploration in Context Space
Exploration becomes more nuanced with context. In standard bandits, we asked: “Should I try this arm more?”
In contextual bandits, we ask: “Should I try this arm for contexts like this one?”
An arm might be well-understood for some contexts but not others. If we’ve never seen a teenager on mobile asking for news, we’re uncertain about which headline they’d prefer—even if we’ve learned a lot about other demographics.
This leads to algorithms like LinUCB that explore based on uncertainty in predictions for the current context, not just global arm uncertainty.
In the linear model, our uncertainty about the expected reward depends on:
- How well we’ve estimated : More samples = lower uncertainty
- How similar is to past contexts: If is unlike anything we’ve seen, we’re uncertain
The covariance matrix captures both:
where .
High variance means high uncertainty—and that’s where we should explore.
Summary: From Bandits to Contextual Bandits
| Aspect | Multi-Armed Bandits | Contextual Bandits |
|---|---|---|
| Goal | Find one best arm | Find best arm per context |
| Input | Nothing (same each round) | Context vector |
| Output | Action | Policy |
| Reward model | (constant per arm) | (function of context) |
| Exploration | Per-arm uncertainty | Per-context uncertainty |
| Generalization | None needed | Across similar contexts |
The key insight: contextual bandits learn to personalize decisions by using features that predict which action will be best. This is the foundation of modern recommendation systems, ad selection, and many other applications.
In the next section, we’ll see LinUCB—the workhorse algorithm for contextual bandits that elegantly extends UCB to handle context.
Contextual bandits sit between simple bandits and full RL. They’re more powerful than bandits (they personalize) but simpler than RL (no sequential dependencies). This sweet spot makes them extremely practical for real-world applications.