Summary & Exercises
Key Takeaways
float16 offers precision for inference, bfloat16 handles training’s large values, and int8/int4 maximize deployment efficiency.Quick Quiz
Test your understanding with these conceptual questions.
bfloat16 work better than float16 for training?Show answer
bfloat16 has 8 exponent bits (same as float32), giving it a much larger range (±3.4×10³⁸ vs ±65,504). During training, gradients and intermediate activations can spike to large values. float16 would overflow to inf, breaking training, while bfloat16 handles these values safely.
float32 vs int4?Show answer
float32: 7B × 4 bytes = 28 GB
int4: 7B × 0.5 bytes = 3.5 GB
That’s an 8× reduction, making the difference between needing a data center GPU and running on a laptop.
Show answer
Use asymmetric quantization for activations after ReLU, which are always non-negative. Symmetric quantization would waste half the range (the negative side). Asymmetric adds a zero-point offset to shift the range, using all available levels efficiently.
Use symmetric for weights, which are typically centered around zero.
Show answer
GPU compute cores have tiny caches (a few MB), while models have billions of parameters (tens of GB). For each forward pass, all weights must be loaded from memory to the compute units. The GPU spends most of its time waiting for data transfer, not computing. Smaller weights (via quantization) = faster loading = more tokens per second.
Show answer
PTQ (Post-Training Quantization): Quantize a pre-trained model using calibration data. Fast (minutes), no training needed, but may lose accuracy.
QAT (Quantization-Aware Training): Train with simulated quantization so weights learn to be quantization-friendly. Takes longer but produces better results.
Choose PTQ when: you need quick results, have limited compute, or are using int8 (mild compression).
Choose QAT when: accuracy is critical, you’re using int4 or lower, or PTQ results are unacceptable.
Coding Exercises
Practice implementing quantization from scratch with these hands-on exercises.
int8 quantization from scratch using NumPy. Calculate scale factors and measure reconstruction error.torch.quantization.quantize_dynamic to quantize a model and measure size/speed improvements.What’s Next?
You now understand how to make neural networks smaller and faster through quantization. These techniques are essential for:
- Deploying RL policies on edge devices (robots, drones, embedded systems)
- Scaling rollout collection by running more parallel environments
- Reducing inference costs in production systems
To go deeper, explore:
- GPTQ and AWQ for LLM-specific quantization
- Mixed-precision training with automatic loss scaling
- Hardware-specific optimizations (TensorRT, ONNX Runtime)