Quantization

What You'll Learn

Understand why quantization is essential for deploying large models
See how reducing precision affects neural network weights
Learn the memory-accuracy tradeoff through interactive exploration

A 7-billion parameter model needs 28 GB just to store its weights:

7B params×32 bits÷8 bits/byte=28 GB

float32 = 32 bits = 4 bytes per number

Most GPUs have 8-24 GB. Quantization shrinks models by storing each parameter in fewer bits.

Try It First

What if we used fewer bits? Drag the slider to see how precision affects the values:

Try It: Quantization Explorer

Drag the slider to see how precision affects values

Precision32-bit

2-bit (extreme)16-bit32-bit (baseline)

Watch values snap to the quantization grid:

-0.330+0.33

OriginalQuantized

Original Distribution (float32)

After 32-bit Quantization

1×

Compression

4294967296

Quantization Levels

0.0%

Avg Error

📦

Full precision (baseline)

32-bit float is the standard training format. No compression, no error—but uses the most memory.

What you just saw: Neural network weights are numbers (like 0.0342 or -0.1567). When we reduce precision from 32 bits to 8 bits, these values “snap” to a grid of allowed values.

The magic? With 256 levels (8-bit), the error is tiny. Most models barely notice the difference.

The Core Tradeoff

float32

32 bits

Baseline

float16

16 bits

2× smaller

int8

8 bits

4× smaller

int4

4 bits

8× smaller

📖Quantization

Mapping continuous (or high-precision) values to a smaller set of discrete values. In neural networks: converting 32-bit weights to 16-bit, 8-bit, or 4-bit.

Why This Matters for RL

🤖

Real-time Control

Robots need millisecond decisions

🎮

Game AI

60 FPS requires fast inference

💬

RLHF Models

Quantized LLMs train and serve faster

📱

Edge Deployment

Drones and cars can’t carry data centers

Chapter Roadmap

Why Quantization Matters

Memory problem, speed gains, and when it works

Number Representations

float32, float16, int8: how computers store numbers

Quantization Methods

PTQ, QAT, GPTQ, and AWQ explained

Hands-On Practice

Quantize a real LLM in our Jupyter notebook

ℹ️Start Here

New to quantization? Begin with Why Quantization Matters to understand the motivation.

Want to jump straight to code? Head to Quantization in Practice.

Key Takeaways

Quantization reduces model memory by using fewer bits per parameter
8-bit quantization typically loses less than 1% accuracy
Modern methods (GPTQ, AWQ) make 4-bit quantization practical for LLMs
Essential for deploying RL policies in real-time