Why Quantization Matters
Neural networks are getting bigger every year. LLaMA 3 has 405 billion parameters. Storing and running these models requires serious hardware—unless you quantize.
The Memory Problem
Try it yourself: Select a model size and see how quantization shrinks memory requirements.
Memory Calculator
See how quantization shrinks model memory
The math is simple: Every parameter takes up space. float32 uses 4 bytes per number.
A 7B model = 7 billion × 4 bytes = 28 GB.
But int4 uses only 0.5 bytes per number. Same model = 7 billion × 0.5 bytes = 3.5 GB.
That’s the difference between “needs a data center” and “runs on a laptop.”
Memory footprint:
where is the parameter count and is bits per parameter.
The Speed Problem
Memory size isn’t just about fitting—it’s about speed.
Bandwidth Limited
For each token, the GPU loads all weights from memory.
Smaller weights = faster loading = more tokens/sec.
GPU compute cores have tiny caches (a few MB). A 7B model is 28GB in float32—the weights simply don’t fit. They must be streamed from memory for every forward pass. This is called being memory-bound: the GPU spends most of its time waiting for data, not computing.
70B model in float32 (280 GB):
- Load time: 280 GB ÷ 2000 GB/s = 140 ms
- Result: ~7 tokens/second
Same model in int4 (35 GB):
- Load time: 35 GB ÷ 2000 GB/s = 17.5 ms
- Result: ~57 tokens/second
8× faster from quantization alone!
Why It Works
Here’s the surprising part: neural networks barely notice.
import torch
import torch.nn as nn
# Quick experiment: how much does int8 hurt?
torch.manual_seed(42)
layer = nn.Linear(512, 512)
x = torch.randn(32, 512)
with torch.no_grad():
y_original = layer(x)
# Simple symmetric quantization
def quantize_int8(t):
scale = t.abs().max() / 127
return torch.round(t / scale).clamp(-128, 127) * scale
layer.weight.data = quantize_int8(layer.weight.data)
with torch.no_grad():
y_quantized = layer(x)
error = (y_original - y_quantized).abs().mean() / y_original.abs().mean()
print(f"Relative error: {error:.2%}") # Typically < 1%When Does It Hurt?
Always measure accuracy after quantization.
Quick Reference
RL systems have distinct phases with different precision needs:
-
Training neural networks: Use
float16orbfloat16. You need gradient precision, but fullfloat32is overkill. Mixed-precision training gives you 2× speedup with minimal accuracy loss. -
Collecting rollouts: Use
int8for policy inference. When running thousands of parallel environments, inference speed matters more than perfect precision. The policy just needs to pick good actions. -
Edge deployment: Use
int4when deploying to robots, drones, or embedded devices. Memory and power are constrained, and real-time response matters more than optimal actions.
Next Up
Now that you understand why quantization works, let’s see how numbers are stored at the bit level.
Continue to Number Representations.