Quantization in Practice
Time to get hands-on. We’ll quantize a real LLM and measure the tradeoffs.
The Practical Workflow
Quick Start: bitsandbytes
The fastest path to quantization. One flag, and your model loads in lower precision.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
MODEL = "Qwen/Qwen2.5-0.5B"
# Float16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
MODEL, torch_dtype=torch.float16, device_map="auto"
)
# 8-bit - just add one flag
model_8bit = AutoModelForCausalLM.from_pretrained(
MODEL, load_in_8bit=True, device_map="auto"
)
# 4-bit with NF4 (best quality)
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
MODEL, quantization_config=config, device_map="auto"
)GPTQ for Production
For best 4-bit quality, use GPTQ with calibration data.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
# Prepare calibration data (use domain-representative text)
calibration_data = [
tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
for text in your_sample_texts[:128]
]
# Configure GPTQ
config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=True, # Helps with quality
)
# Quantize
model = AutoGPTQForCausalLM.from_pretrained(MODEL, config)
model.quantize(calibration_data)
model.save_quantized("./model-gptq-4bit")Measuring Quality
Quantization can fail silently. A model might look fine on average but fail on specific inputs.
Perplexity Test
from datasets import load_dataset
def evaluate_perplexity(model, tokenizer, max_samples=100):
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
total_loss, total_tokens = 0, 0
for sample in dataset.select(range(max_samples)):
inputs = tokenizer(sample["text"], return_tensors="pt", truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
loss = model(**inputs, labels=inputs["input_ids"]).loss
total_loss += loss.item() * inputs["input_ids"].shape[1]
total_tokens += inputs["input_ids"].shape[1]
return np.exp(total_loss / total_tokens)
# Compare
for name, model in [("FP16", model_fp16), ("8-bit", model_8bit), ("4-bit", model_4bit)]:
ppl = evaluate_perplexity(model, tokenizer)
print(f"{name}: {ppl:.2f}")Speed Benchmarks
import time
def benchmark_speed(model, tokenizer, prompt, n_tokens=100, n_runs=5):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Warmup
with torch.no_grad():
model.generate(**inputs, max_new_tokens=10)
# Benchmark
torch.cuda.synchronize()
start = time.time()
for _ in range(n_runs):
with torch.no_grad():
model.generate(**inputs, max_new_tokens=n_tokens, do_sample=False)
torch.cuda.synchronize()
total_time = time.time() - start
tokens_per_sec = (n_tokens * n_runs) / total_time
return tokens_per_secFor RL Policies
RL has special needs:
- Policy network: Speed matters most (many rollouts)
- Value network: Accuracy matters most (training signal)
Solution: Quantize the policy more aggressively than the critic.
class QuantizedRLPolicy(nn.Module):
def __init__(self, policy_net, value_net):
super().__init__()
# Quantize policy for speed
self.policy = torch.quantization.quantize_dynamic(
policy_net, {nn.Linear}, dtype=torch.qint8
)
# Keep value at higher precision
self.value = value_net.half()
def forward(self, state):
action_logits = self.policy(state)
value = self.value(state.half())
return action_logits, valueComplete Notebook
Run the full workflow yourself:
The notebook includes:
✓ Load Qwen-0.5B in fp16, 8-bit, 4-bit
✓ Memory comparison
✓ Perplexity evaluation
✓ Speed benchmarks
✓ GPTQ quantization
✓ Runs on free Colab T4
Quick Reference
- Start with
float16→ nearly lossless baseline - Try
int8first → usually good enough - Use GPTQ/AWQ for
int4→ better than bitsandbytes - Always measure → perplexity, task accuracy, speed
- Test edge cases → unusual inputs reveal problems
int8 policy, fp16 valueNext Up
Ready to test your understanding? The next section has a summary of key concepts plus hands-on coding exercises.
Continue to Summary & Exercises.