Quantization
What You'll Learn
- Understand why quantization is essential for deploying large models
- See how reducing precision affects neural network weights
- Learn the memory-accuracy tradeoff through interactive exploration
A 7-billion parameter model needs 28 GB just to store its weights:
Most GPUs have 8-24 GB. Quantization shrinks models by storing each parameter in fewer bits.
Try It First
What if we used fewer bits? Drag the slider to see how precision affects the values:
Try It: Quantization Explorer
Drag the slider to see how precision affects values
What you just saw: Neural network weights are numbers (like 0.0342 or -0.1567). When we reduce precision from 32 bits to 8 bits, these values “snap” to a grid of allowed values.
The magic? With 256 levels (8-bit), the error is tiny. Most models barely notice the difference.
The Core Tradeoff
Mapping continuous (or high-precision) values to a smaller set of discrete values. In neural networks: converting 32-bit weights to 16-bit, 8-bit, or 4-bit.
Why This Matters for RL
Chapter Roadmap
Why Quantization Matters
Memory problem, speed gains, and when it works
Number Representations
float32, float16, int8: how computers store numbers
Quantization Methods
PTQ, QAT, GPTQ, and AWQ explained
Hands-On Practice
Quantize a real LLM in our Jupyter notebook
New to quantization? Begin with Why Quantization Matters to understand the motivation.
Want to jump straight to code? Head to Quantization in Practice.
Key Takeaways
- Quantization reduces model memory by using fewer bits per parameter
- 8-bit quantization typically loses less than 1% accuracy
- Modern methods (GPTQ, AWQ) make 4-bit quantization practical for LLMs
- Essential for deploying RL policies in real-time