Contextual Bandits
What You'll Learn
- Understand how contextual bandits extend the basic bandit problem
- Learn the role of context features in action selection
- Implement linear contextual bandits with Îľ-greedy exploration
- Recognize real-world applications (recommendations, ads, personalization)
- See the bridge between bandits and full reinforcement learning
When the Best Action Depends on Whoâs Asking
A movie recommendation system faces a unique challenge: the best movie to recommend depends on whoâs asking. A thriller fan and a comedy lover shouldnât get the same suggestion. A news website has the same problemâthe article that engages one reader might bore another.
This is a contextual bandit: the right action depends on context. And itâs everywhereâpersonalized ads, news feeds, treatment selection, product recommendations. The same exploration-exploitation tradeoff from multi-armed bandits applies, but now we must learn which action is best for which situation.
Think back to our multi-armed bandit with slot machines. Every time you pulled a lever, you got a reward from the same distribution. The best arm was the same for everyone.
Now imagine the casino has magic slot machines that pay out differently based on whoâs playing. A machine that loves red shirts pays more if youâre wearing red. Another pays more to tall people. Suddenly, finding âthe best armâ isnât enoughâyou need to find the best arm for you.
This is the contextual bandit problem:
- You observe context (features about the situation)
- You choose an action from a fixed set
- You receive a reward that depends on both context and action
- Your goal: learn which action is best for each context
The key difference from regular bandits: weâre learning a policyâa mapping from context to actionânot just finding a single best action.
The Contextual Bandit Problem
Letâs formalize what weâre solving.
At each round :
- The environment presents a context (a vector of features)
- The agent chooses an action from a set of actions
- The agent receives a reward that depends on both and
- The agent updates its knowledge to make better decisions
What is context? Itâs any information about the current situation that might affect which action is best:
- For movie recommendations: user age, viewing history, time of day
- For ad placement: user demographics, page content, device type
- For medical treatment: patient symptoms, medical history, genetics
What are actions? The choices we can make:
- For recommendations: which movie/article/product to show
- For ads: which ad to display
- For treatment: which medication to prescribe
What is reward? A signal of how good our choice was:
- For recommendations: did the user click? Watch the whole thing? Rate it highly?
- For ads: did the user click? Convert to a sale?
- For treatment: did the patient improve?
Mathematical Details
Formally, a contextual bandit is defined by:
- Context space : The set of possible contexts (often for features)
- Action space : A finite set of actions
- Reward function: For each context-action pair, thereâs a reward distribution
The agentâs goal is to learn a policy that maximizes expected reward:
Unlike supervised learning, we only observe the reward for the action we tookânot what would have happened if weâd chosen differently. This is bandit feedback.
Why Not Just Use Supervised Learning?
A tempting thought: âI have context and want to predict the best action. Thatâs classification!â But thereâs a crucial difference.
In supervised learning, you have labeled data: input-output pairs where someone tells you the correct answer. You could train a classifier on (user_features, best_movie) pairs.
In contextual bandits, you only learn the reward for the action you took. If you recommended Action Hero 3 and the user didnât click, you donât know if they would have clicked on Romantic Comedy 2 or Documentary About Penguins.
This is called partial feedback or bandit feedback:
- Supervised: You see the label for the correct answer
- Bandits: You see the reward for your answer only
This matters because:
- You canât just collect a datasetâyou need to interact to learn
- Exploration is essentialâyou must try actions to learn about them
- The exploration-exploitation tradeoff from regular bandits still applies
A common mistake is treating contextual bandits as classification. If you train a classifier on historical data (context â action taken), youâll learn to imitate past behaviorâincluding past mistakes. You wonât discover actions that were never tried. This is called the exploitation trap.
Learning from Context: Linear Models
How do we estimate which action is best for a given context? We need a model that predicts rewards based on context.
The simplest approach: assume rewards are a linear function of context features.
For each action , we learn a weight vector such that:
If context has features [age=25, watches_action=0.8, watches_comedy=0.3], and weâre predicting reward for showing an action movie, our model computes:
Different actions have different weight vectors, so the same context can yield different predicted rewards for different actions.
Mathematical Details
For each action , we model the expected reward as:
where:
- is the context feature vector
- is the learned weight vector for action
- is our estimate of expected reward
To choose an action, we use Îľ-greedy. With probability , select . With probability , select a random action.
To update our model, we use online least squares. After observing :
This moves in the direction that would have predicted better.
A Concrete Example: News Article Recommendation
You run a news website. You have 5 articles to recommend. Each user has features:
age: normalized between 0-1tech_interest: how much they read tech articles (0-1)politics_interest: how much they read politics (0-1)sports_interest: how much they read sports (0-1)
Your articles are:
- âNew iPhone Releasedâ (tech)
- âElection Updatesâ (politics)
- âChampionship Game Resultsâ (sports)
- âCelebrity Gossipâ (entertainment)
- âStock Market Analysisâ (finance)
A user arrives with features [age=0.3, tech=0.9, politics=0.2, sports=0.1]. Your linear models predict:
- Article 1: (high tech interest matches)
- Article 2: (low politics interest)
- Article 3: (low sports interest)
- âŚ
With Îľ-greedy, youâd usually show Article 1, but sometimes explore others to refine your model.
Implementation: Linear Contextual Bandit
Letâs implement a complete linear contextual bandit.
Implementation
import numpy as np
class LinearContextualBandit:
"""
Linear contextual bandit with Îľ-greedy exploration.
Each action has its own linear model predicting rewards from context.
"""
def __init__(self, n_actions, n_features, epsilon=0.1, alpha=0.1):
self.n_actions = n_actions
self.n_features = n_features
self.epsilon = epsilon
self.alpha = alpha
# One weight vector per action, initialized to zeros
self.theta = np.zeros((n_actions, n_features))
def predict(self, context):
"""Predict expected reward for each action given context."""
# Shape: (n_actions,) - one prediction per action
return self.theta @ context
def select_action(self, context):
"""Select action using Îľ-greedy policy."""
if np.random.random() < self.epsilon:
# Explore: random action
return np.random.randint(self.n_actions)
else:
# Exploit: best predicted action
predictions = self.predict(context)
return np.argmax(predictions)
def update(self, context, action, reward):
"""Update model for the chosen action based on observed reward."""
# Prediction error for this action
prediction = self.theta[action] @ context
error = reward - prediction
# Gradient descent update
self.theta[action] += self.alpha * error * contextSimulating a Personalization Problem
Implementation
Letâs test our bandit on a simulated recommendation problem:
class PersonalizationEnvironment:
"""
Simulated environment where optimal action depends on user features.
True reward = theta_true[action] @ context + noise
"""
def __init__(self, n_actions, n_features, noise_std=0.1):
self.n_actions = n_actions
self.n_features = n_features
self.noise_std = noise_std
# Hidden true weights (the bandit tries to learn these)
self.theta_true = np.random.randn(n_actions, n_features)
def get_context(self):
"""Generate a random context (user features)."""
return np.random.randn(self.n_features)
def get_reward(self, context, action):
"""Return noisy reward for action in context."""
expected_reward = self.theta_true[action] @ context
return expected_reward + np.random.randn() * self.noise_std
def optimal_action(self, context):
"""Return the truly best action for this context."""
expected_rewards = self.theta_true @ context
return np.argmax(expected_rewards)
def run_experiment(n_rounds=1000, n_actions=5, n_features=4):
"""Run a contextual bandit experiment."""
env = PersonalizationEnvironment(n_actions, n_features)
bandit = LinearContextualBandit(n_actions, n_features, epsilon=0.1)
rewards = []
optimal_actions = []
for t in range(n_rounds):
# Get context for this round
context = env.get_context()
# Select action
action = bandit.select_action(context)
# Get reward
reward = env.get_reward(context, action)
# Update model
bandit.update(context, action, reward)
# Track metrics
rewards.append(reward)
optimal_actions.append(action == env.optimal_action(context))
return rewards, optimal_actions
# Run and analyze
rewards, optimal = run_experiment()
print(f"Final 100 rounds - Avg reward: {np.mean(rewards[-100:]):.3f}")
print(f"Final 100 rounds - Optimal action rate: {np.mean(optimal[-100:]):.1%}")Watch what happens as the experiment progresses:
- Early on, the bandit explores randomly and makes mistakes
- As it gathers data, it learns which action works best for which contexts
- Eventually, it converges to selecting near-optimal actions
The exploration parameter Îľ controls this tradeoff. High Îľ means more exploration but lower performance. Low Îľ means faster exploitation but risk of missing good actions.
The Value of Context: Personalization Wins
Why bother with context? Letâs compare contextual bandits against regular bandits that ignore context.
Consider two strategies:
- Context-blind bandit: Ignores user features, learns one best action for everyone
- Contextual bandit: Uses user features, learns different best actions for different users
When different users genuinely prefer different things, the contextual bandit dominates. It can recommend action movies to action fans and comedies to comedy fans, while the context-blind bandit must pick one genre for everyone.
The advantage is largest when:
- User preferences vary widely
- Context features are predictive of preferences
- The action space is diverse
Implementation
def compare_contextual_vs_blind(n_rounds=2000):
"""Compare contextual bandit against context-blind bandit."""
n_actions, n_features = 5, 4
# Same environment for both
env = PersonalizationEnvironment(n_actions, n_features)
# Contextual bandit
contextual = LinearContextualBandit(n_actions, n_features)
# Context-blind bandit (standard epsilon-greedy)
blind_estimates = np.zeros(n_actions)
blind_counts = np.zeros(n_actions) + 1e-6 # Avoid division by zero
contextual_rewards = []
blind_rewards = []
for t in range(n_rounds):
context = env.get_context()
# Contextual bandit decision
ctx_action = contextual.select_action(context)
ctx_reward = env.get_reward(context, ctx_action)
contextual.update(context, ctx_action, ctx_reward)
contextual_rewards.append(ctx_reward)
# Context-blind bandit decision (Îľ-greedy ignoring context)
if np.random.random() < 0.1:
blind_action = np.random.randint(n_actions)
else:
blind_action = np.argmax(blind_estimates)
blind_reward = env.get_reward(context, blind_action)
# Update context-blind estimates (average reward per action)
blind_counts[blind_action] += 1
blind_estimates[blind_action] += (
(blind_reward - blind_estimates[blind_action]) / blind_counts[blind_action]
)
blind_rewards.append(blind_reward)
return contextual_rewards, blind_rewards
# Compare performance
ctx_rewards, blind_rewards = compare_contextual_vs_blind()
# Rolling average for last 500 rounds
print(f"Contextual bandit avg reward: {np.mean(ctx_rewards[-500:]):.3f}")
print(f"Context-blind bandit avg reward: {np.mean(blind_rewards[-500:]):.3f}")LinUCB: Smarter Exploration
Our Îľ-greedy approach explores randomly. But can we be smarterâexploring where weâre most uncertain?
Recall UCB from multi-armed bandits: it added an âuncertainty bonusâ to each action, favoring actions weâre less sure about.
LinUCB extends this idea to contextual bandits. Instead of just picking the action with highest predicted reward, it picks the action with the highest upper confidence bound:
The uncertainty bonus is larger when:
- Weâve seen fewer examples of this action
- The current context is unlike contexts weâve seen before for this action
This means LinUCB automatically explores in unfamiliar situations while exploiting in familiar ones.
Mathematical Details
LinUCB maintains, for each action :
- : A matrix tracking the âinformationâ gathered about action
- : A vector tracking the rewards weâve observed
The weight estimate is:
The uncertainty for context and action is:
The update after observing :
The parameter controls how much we favor exploration. Higher means more exploration.
Implementation
class LinUCB:
"""
LinUCB algorithm for contextual bandits.
Uses upper confidence bounds for principled exploration.
"""
def __init__(self, n_actions, n_features, alpha=1.0):
self.n_actions = n_actions
self.n_features = n_features
self.alpha = alpha
# For each action: A matrix and b vector
self.A = [np.eye(n_features) for _ in range(n_actions)]
self.b = [np.zeros(n_features) for _ in range(n_actions)]
def select_action(self, context):
"""Select action with highest UCB."""
ucbs = []
for a in range(self.n_actions):
# Compute theta estimate
A_inv = np.linalg.inv(self.A[a])
theta = A_inv @ self.b[a]
# Predicted reward
pred = theta @ context
# Uncertainty bonus
uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
ucbs.append(pred + uncertainty)
return np.argmax(ucbs)
def update(self, context, action, reward):
"""Update model for chosen action."""
self.A[action] += np.outer(context, context)
self.b[action] += reward * contextLinUCB typically outperforms Îľ-greedy, especially early in learning when uncertainty is high. However, itâs more computationally expensive due to matrix inversions. For large feature spaces, consider using Sherman-Morrison updates or diagonal approximations.
Real-World Applications
Contextual bandits power many real systems. Here are some prominent examples.
Personalized Recommendations
Netflix, Spotify, YouTube: These platforms face a contextual bandit problem every time you open the app.
- Context: Your viewing history, time of day, device, current mood (inferred)
- Actions: Which content to show in the top slot
- Reward: Did you watch/listen? For how long? Did you rate it highly?
They canât show you everything, so they must balance:
- Exploitation: Show content similar to what youâve liked
- Exploration: Try new genres to discover hidden preferences
Online Advertising
Google, Facebook, Amazon: Ad placement is the canonical contextual bandit application.
- Context: User demographics, browsing history, page content
- Actions: Which ad to display
- Reward: Click-through, conversion, revenue
The stakes are highâbetter contextual bandits mean billions in additional revenue. Companies invest heavily in this area.
Clinical Trials and Treatment Selection
Medical applications: Which treatment works best for which patient?
- Context: Patient features (age, symptoms, genetics, history)
- Actions: Available treatments
- Reward: Treatment outcome (improvement, side effects)
This is sometimes called precision medicine. Instead of finding the best treatment on average, we find the best treatment for this patient.
Exploration is particularly delicate hereâwe donât want to give suboptimal treatments just to gather data. Techniques like Thompson Sampling can help by quantifying uncertainty.
In high-stakes applications like medicine, we often use offline evaluation techniques to estimate policy performance from historical data before deploying new policies. This is an active research area called off-policy evaluation.
Contextual Bandits vs. Full RL
Weâve progressed from multi-armed bandits (no context) to contextual bandits (context matters). Whatâs next?
The key assumption in contextual bandits: your action doesnât affect future contexts.
When you recommend a movie, the next userâs features donât depend on your recommendation. Each interaction is independent.
But what if actions do affect the future?
- A chess move changes the board state
- A robotâs step changes its position
- A treatment affects the patientâs future health
This is full reinforcement learning: actions affect states, which affects future rewards, which requires planning ahead.
| Problem | Context/State | Actions Affect Future? |
|---|---|---|
| Multi-armed bandit | None | N/A |
| Contextual bandit | Yes, observed | No |
| Full RL (MDP) | Yes, observed | Yes |
The techniques weâve learnedâvalue estimation, exploration-exploitation tradeoff, balancing uncertaintyâall carry forward into full RL. Contextual bandits are a stepping stone.
Summary
Key Takeaways
- Contextual bandits extend multi-armed bandits by conditioning on observable context features
- We learn a policy (context â action mapping), not just a single best action
- Linear models provide a simple way to predict rewards from context:
- Exploration strategies like Îľ-greedy and LinUCB balance learning and earning
- Applications are everywhere: recommendations, ads, medical treatment, personalization
- The key simplification vs. full RL: actions donât affect future contexts
Exercises
Conceptual Questions
-
Why canât we use supervised learning directly for contextual bandits? Explain the difference between full feedback (supervised) and bandit feedback.
-
When would a contextual bandit outperform a context-blind bandit? Describe a scenario where ignoring context leads to poor performance.
-
Whatâs the key difference between contextual bandits and full RL? Why does this difference matter for algorithm design?
Coding Challenges
-
Implement a movie recommendation simulation: Create an environment with 3 user types (action fans, comedy fans, drama fans) and 5 movies. Show that a contextual bandit learns to recommend appropriately to each type.
-
Compare Îľ-greedy and LinUCB: Run both algorithms on the same environment. Plot learning curves. When does LinUCBâs advantage show most clearly?
-
Feature engineering experiment: Start with a contextual bandit using 2 features. Add 3 irrelevant noise features. How does performance change? What does this tell you about feature selection?
Open-Ended Exploration
- Design your own contextual bandit application: Think of a decision problem you encounter regularly where context matters. Define:
- What context features would you use?
- What actions are available?
- How would you measure reward?
- What makes exploration tricky in your setting?