ML Concepts • Part 3 of 5
Editor Reviewed

Quantization Methods

Post-training quantization vs quantization-aware training

Quantization Methods

You know the “what” of quantization. Now let’s cover the “how”—the methods for actually quantizing a model.

The Two Approaches

📦
Post-Training (PTQ)
Quantize after training is complete
Fast—no retraining
Works for huge models
~ May lose some accuracy
🔧
Quantization-Aware (QAT)
Train with quantization in the loop
Best accuracy
Works for extreme quantization
Requires training compute

Post-Training Quantization (PTQ)

PTQ is simple: take a trained model, convert weights to lower precision.

🧠
Trained Model
float32
⚙️
PTQ
Convert weights
🚀
Quantized Model
int8/int4

Static vs Dynamic

Static PTQ
Fixed scales from calibration data
• Run sample data to find ranges
• Scales fixed at inference time
• Faster inference
Dynamic PTQ
Scales computed per input
• No calibration needed
• Adapts to each input
• Slightly slower
</>Implementation
import torch
import torch.nn as nn

model = YourModel()

# Dynamic PTQ - just one line!
model_quantized = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},  # Which layers to quantize
    dtype=torch.qint8
)

Quantization-Aware Training (QAT)

QAT inserts “fake quantization” during training. The model learns to be robust to quantization noise.

The Straight-Through Estimator Trick
Forward Pass
y = quantize(x)
Simulate quantization error
Backward Pass
∂L/∂x = ∂L/∂y
Gradient flows through unchanged

Forward: apply quantization. Backward: pretend it didn’t happen.

Mathematical Details

Forward: y=Q(x)y = Q(x) where QQ is quantize-then-dequantize

Backward: Lx=Ly\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} (straight-through)

This is mathematically “wrong” but works remarkably well in practice.

💡When to Use QAT
  • PTQ gives unacceptable accuracy loss
  • You need aggressive quantization (int4, int2)
  • You have compute budget for training/fine-tuning

Modern LLM Methods

For large language models, specialized methods have emerged:

GPTQ
Optimal Brain Quantization for LLMs
Layer-wise optimization
Quantizes one layer at a time
Error compensation
Adjusts remaining weights to cancel error
Calibration data
Uses ~128 samples
AWQ
Activation-Aware Weight Quantization
Protects salient weights
Important weights get more precision
Channel scaling
Scales up important channels before quantizing
Activation-based
Importance judged by activation magnitude

How GPTQ Works

GPTQ’s insight: weight errors can cancel out. If one weight is rounded up, another can be rounded down to compensate.

Instead of independently rounding each weight, GPTQ:

  1. Quantizes weights one at a time
  2. After each quantization, adjusts remaining weights to minimize output error
  3. Repeats until all weights are quantized

The result: much better accuracy than naive rounding.

GPTQ Error Compensation
w₁
Quantize
w₂
w₃
Adjust to compensate
w₂
Next weight
Mathematical Details

GPTQ minimizes layer-wise error:

minQWXQXF2\min_{Q} \|WX - QX\|_F^2

Using the Hessian H=XTXH = X^TX, it updates remaining weights after quantizing wqw_q:

δw=wqwHqq1Hq\delta w = -\frac{w_q - w}{H_{qq}^{-1}} H_q

How AWQ Works

AWQ’s insight: not all weights matter equally. Weights that consistently produce large activations are “salient”—quantizing them hurts more.

AWQ protects salient weights by:

  1. Identifying which input channels cause large activations
  2. Scaling up weights for those channels (more quantization levels)
  3. Scaling down the corresponding activations to compensate
</>Implementation
# AWQ concept (simplified)
def compute_saliency(weight, activations):
    """Which channels matter most?"""
    act_scale = activations.abs().mean(dim=0)
    weight_scale = weight.abs().mean(dim=0)
    return act_scale * weight_scale

saliency = compute_saliency(layer.weight, sample_activations)

# Scale up important channels before quantizing
scales = (saliency / saliency.max()).sqrt()
scaled_weight = layer.weight * scales

# Now quantize - important channels have more precision
quantized = quantize(scaled_weight)

Decision Flowchart

Which Method Should You Use?
🎯
Need int8?
→ Dynamic PTQ (one line of code)
🤖
Quantizing an LLM to int4?
→ GPTQ or AWQ (most common)
📉
PTQ accuracy unacceptable?
→ Try AWQ, then QAT if needed
Need extreme compression (int2)?
→ QAT is your only option

Quick Comparison

MethodSpeedBitsAccuracyBest For
Dynamic PTQFast8GoodQuick start
Static PTQFast8GoodProduction
GPTQMedium4GreatLLMs
AWQMedium4GreatLLMs
QATSlow2-8BestExtreme

Next Up

ℹ️Note

Theory complete! Let’s put it into practice. The next section walks through quantizing a real LLM.

Continue to Quantization in Practice.