Chapter 401
Editor ReviewedLast reviewed: Jan 2, 2025

Quantization

Reducing model size and speeding up inference by using lower-precision numbers

Quantization

What You'll Learn

  • Understand why quantization is essential for deploying large models
  • See how reducing precision affects neural network weights
  • Learn the memory-accuracy tradeoff through interactive exploration

A 7-billion parameter model needs 28 GB just to store its weights:

7B params×32 bits÷8 bits/byte=28 GB
float32 = 32 bits = 4 bytes per number

Most GPUs have 8-24 GB. Quantization shrinks models by storing each parameter in fewer bits.

Try It First

What if we used fewer bits? Drag the slider to see how precision affects the values:

Try It: Quantization Explorer

Drag the slider to see how precision affects values

Precision32-bit
2-bit (extreme)16-bit32-bit (baseline)
Watch values snap to the quantization grid:
-0.330+0.33
OriginalQuantized
Original Distribution (float32)
After 32-bit Quantization
1×
Compression
4294967296
Quantization Levels
0.0%
Avg Error
📦
Full precision (baseline)
32-bit float is the standard training format. No compression, no error—but uses the most memory.

What you just saw: Neural network weights are numbers (like 0.0342 or -0.1567). When we reduce precision from 32 bits to 8 bits, these values “snap” to a grid of allowed values.

The magic? With 256 levels (8-bit), the error is tiny. Most models barely notice the difference.

The Core Tradeoff

float32
32 bits
Baseline
float16
16 bits
2× smaller
int8
8 bits
4× smaller
int4
4 bits
8× smaller
📖Quantization

Mapping continuous (or high-precision) values to a smaller set of discrete values. In neural networks: converting 32-bit weights to 16-bit, 8-bit, or 4-bit.

Why This Matters for RL

🤖
Real-time Control
Robots need millisecond decisions
🎮
Game AI
60 FPS requires fast inference
💬
RLHF Models
Quantized LLMs train and serve faster
📱
Edge Deployment
Drones and cars can’t carry data centers

Chapter Roadmap

ℹ️Start Here

New to quantization? Begin with Why Quantization Matters to understand the motivation.

Want to jump straight to code? Head to Quantization in Practice.


Key Takeaways

  • Quantization reduces model memory by using fewer bits per parameter
  • 8-bit quantization typically loses less than 1% accuracy
  • Modern methods (GPTQ, AWQ) make 4-bit quantization practical for LLMs
  • Essential for deploying RL policies in real-time