Summary & Exercises

Key Takeaways

Quantization trades precision for efficiency

Using fewer bits per number reduces memory and speeds up inference, with minimal accuracy loss in well-designed models.

Different formats for different purposes

float16 offers precision for inference, bfloat16 handles training’s large values, and int8/int4 maximize deployment efficiency.

PTQ vs QAT: calibration vs training

Post-Training Quantization is fast but less accurate. Quantization-Aware Training takes longer but produces better results, especially for aggressive compression.

Scale and zero-point are the core parameters

Quantization maps continuous values to discrete integers using a scale factor (and optionally a zero-point for asymmetric quantization).

Per-channel beats per-tensor

Using separate scales for each channel preserves more information when weight magnitudes vary across channels.

Quick Quiz

Test your understanding with these conceptual questions.

1. Why does bfloat16 work better than float16 for training?

Show answer

bfloat16 has 8 exponent bits (same as float32), giving it a much larger range (±3.4×10³⁸ vs ±65,504). During training, gradients and intermediate activations can spike to large values. float16 would overflow to inf, breaking training, while bfloat16 handles these values safely.

2. A model has 7 billion parameters. How much memory does it need in float32 vs int4?

Show answer

float32: 7B × 4 bytes = 28 GB
int4: 7B × 0.5 bytes = 3.5 GB

That’s an 8× reduction, making the difference between needing a data center GPU and running on a laptop.

3. When would you use asymmetric quantization instead of symmetric?

Show answer

Use asymmetric quantization for activations after ReLU, which are always non-negative. Symmetric quantization would waste half the range (the negative side). Asymmetric adds a zero-point offset to shift the range, using all available levels efficiently.

Use symmetric for weights, which are typically centered around zero.

4. Why is inference typically memory-bound rather than compute-bound?

Show answer

GPU compute cores have tiny caches (a few MB), while models have billions of parameters (tens of GB). For each forward pass, all weights must be loaded from memory to the compute units. The GPU spends most of its time waiting for data transfer, not computing. Smaller weights (via quantization) = faster loading = more tokens per second.

5. What’s the difference between PTQ and QAT? When would you choose each?

Show answer

PTQ (Post-Training Quantization): Quantize a pre-trained model using calibration data. Fast (minutes), no training needed, but may lose accuracy.

QAT (Quantization-Aware Training): Train with simulated quantization so weights learn to be quantization-friendly. Takes longer but produces better results.

Choose PTQ when: you need quick results, have limited compute, or are using int8 (mild compression).
Choose QAT when: accuracy is critical, you’re using int4 or lower, or PTQ results are unacceptable.

Coding Exercises

Practice implementing quantization from scratch with these hands-on exercises.

🚀

Open Exercises in Colab

Includes solutions

Exercise 1: Symmetric Quantization

Implement int8 quantization from scratch using NumPy. Calculate scale factors and measure reconstruction error.

Exercise 2: Per-Channel vs Per-Tensor

Compare quantization error when using one scale for all weights vs one scale per channel.

Exercise 3: Visualize Error vs Bits

Plot how quantization error changes from 2-bit to 8-bit. See the exponential improvement.

Exercise 4: PyTorch Quantization

Use torch.quantization.quantize_dynamic to quantize a model and measure size/speed improvements.

What’s Next?

ℹ️Note

You now understand how to make neural networks smaller and faster through quantization. These techniques are essential for:

Deploying RL policies on edge devices (robots, drones, embedded systems)
Scaling rollout collection by running more parallel environments
Reducing inference costs in production systems

To go deeper, explore:

GPTQ and AWQ for LLM-specific quantization
Mixed-precision training with automatic loss scaling
Hardware-specific optimizations (TensorRT, ONNX Runtime)