Quantization Methods
You know the “what” of quantization. Now let’s cover the “how”—the methods for actually quantizing a model.
The Two Approaches
Post-Training Quantization (PTQ)
PTQ is simple: take a trained model, convert weights to lower precision.
Static vs Dynamic
import torch
import torch.nn as nn
model = YourModel()
# Dynamic PTQ - just one line!
model_quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear}, # Which layers to quantize
dtype=torch.qint8
)Quantization-Aware Training (QAT)
QAT inserts “fake quantization” during training. The model learns to be robust to quantization noise.
Forward: apply quantization. Backward: pretend it didn’t happen.
Forward: where is quantize-then-dequantize
Backward: (straight-through)
This is mathematically “wrong” but works remarkably well in practice.
- PTQ gives unacceptable accuracy loss
- You need aggressive quantization (
int4,int2) - You have compute budget for training/fine-tuning
Modern LLM Methods
For large language models, specialized methods have emerged:
How GPTQ Works
GPTQ’s insight: weight errors can cancel out. If one weight is rounded up, another can be rounded down to compensate.
Instead of independently rounding each weight, GPTQ:
- Quantizes weights one at a time
- After each quantization, adjusts remaining weights to minimize output error
- Repeats until all weights are quantized
The result: much better accuracy than naive rounding.
GPTQ minimizes layer-wise error:
Using the Hessian , it updates remaining weights after quantizing :
How AWQ Works
AWQ’s insight: not all weights matter equally. Weights that consistently produce large activations are “salient”—quantizing them hurts more.
AWQ protects salient weights by:
- Identifying which input channels cause large activations
- Scaling up weights for those channels (more quantization levels)
- Scaling down the corresponding activations to compensate
# AWQ concept (simplified)
def compute_saliency(weight, activations):
"""Which channels matter most?"""
act_scale = activations.abs().mean(dim=0)
weight_scale = weight.abs().mean(dim=0)
return act_scale * weight_scale
saliency = compute_saliency(layer.weight, sample_activations)
# Scale up important channels before quantizing
scales = (saliency / saliency.max()).sqrt()
scaled_weight = layer.weight * scales
# Now quantize - important channels have more precision
quantized = quantize(scaled_weight)Decision Flowchart
int8?int4?Quick Comparison
| Method | Speed | Bits | Accuracy | Best For |
|---|---|---|---|---|
| Dynamic PTQ | Fast | 8 | Good | Quick start |
| Static PTQ | Fast | 8 | Good | Production |
| GPTQ | Medium | 4 | Great | LLMs |
| AWQ | Medium | 4 | Great | LLMs |
| QAT | Slow | 2-8 | Best | Extreme |
Quantization can fail silently. Your model might look fine on average but fail on specific inputs.
Always benchmark after quantizing:
- Run standard evaluation metrics
- Test edge cases
- Check specific failure modes
Next Up
Theory complete! Let’s put it into practice. The next section walks through quantizing a real LLM.
Continue to Quantization in Practice.