Quantization Methods

You know the “what” of quantization. Now let’s cover the “how”—the methods for actually quantizing a model.

The Two Approaches

📦

Post-Training (PTQ)

Quantize after training is complete

✓ Fast—no retraining

✓ Works for huge models

~ May lose some accuracy

🔧

Quantization-Aware (QAT)

Train with quantization in the loop

✓ Best accuracy

✓ Works for extreme quantization

✗ Requires training compute

Post-Training Quantization (PTQ)

PTQ is simple: take a trained model, convert weights to lower precision.

🧠

Trained Model

float32

→

⚙️

PTQ

Convert weights

→

🚀

Quantized Model

int8/int4

Static vs Dynamic

Static PTQ

Fixed scales from calibration data

• Run sample data to find ranges

• Scales fixed at inference time

• Faster inference

Dynamic PTQ

Scales computed per input

• No calibration needed

• Adapts to each input

• Slightly slower

</>Implementation

import torch
import torch.nn as nn

model = YourModel()

# Dynamic PTQ - just one line!
model_quantized = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},  # Which layers to quantize
    dtype=torch.qint8
)

Quantization-Aware Training (QAT)

QAT inserts “fake quantization” during training. The model learns to be robust to quantization noise.

The Straight-Through Estimator Trick

Forward Pass

y = quantize(x)

Simulate quantization error

Backward Pass

∂L/∂x = ∂L/∂y

Gradient flows through unchanged

Forward: apply quantization. Backward: pretend it didn’t happen.

∑Mathematical Details

Forward: $y = Q(x)$ where $Q$ is quantize-then-dequantize

Backward: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}$ (straight-through)

This is mathematically “wrong” but works remarkably well in practice.

💡When to Use QAT

PTQ gives unacceptable accuracy loss
You need aggressive quantization (int4, int2)
You have compute budget for training/fine-tuning

Modern LLM Methods

For large language models, specialized methods have emerged:

GPTQ

Optimal Brain Quantization for LLMs

•

Layer-wise optimization

Quantizes one layer at a time

•

Error compensation

Adjusts remaining weights to cancel error

•

Calibration data

Uses ~128 samples

AWQ

Activation-Aware Weight Quantization

•

Protects salient weights

Important weights get more precision

•

Channel scaling

Scales up important channels before quantizing

•

Activation-based

Importance judged by activation magnitude

How GPTQ Works

GPTQ’s insight: weight errors can cancel out. If one weight is rounded up, another can be rounded down to compensate.

Instead of independently rounding each weight, GPTQ:

Quantizes weights one at a time
After each quantization, adjusts remaining weights to minimize output error
Repeats until all weights are quantized

The result: much better accuracy than naive rounding.

GPTQ Error Compensation

w₁

Quantize

→

w₂

w₃

…

Adjust to compensate

→

w₂

Next weight

∑Mathematical Details

GPTQ minimizes layer-wise error:

$\min_{Q} \|WX - QX\|_F^2$

Using the Hessian $H = X^TX$ , it updates remaining weights after quantizing $w_q$ :

$\delta w = -\frac{w_q - w}{H_{qq}^{-1}} H_q$

How AWQ Works

AWQ’s insight: not all weights matter equally. Weights that consistently produce large activations are “salient”—quantizing them hurts more.

AWQ protects salient weights by:

Identifying which input channels cause large activations
Scaling up weights for those channels (more quantization levels)
Scaling down the corresponding activations to compensate

</>Implementation

# AWQ concept (simplified)
def compute_saliency(weight, activations):
    """Which channels matter most?"""
    act_scale = activations.abs().mean(dim=0)
    weight_scale = weight.abs().mean(dim=0)
    return act_scale * weight_scale

saliency = compute_saliency(layer.weight, sample_activations)

# Scale up important channels before quantizing
scales = (saliency / saliency.max()).sqrt()
scaled_weight = layer.weight * scales

# Now quantize - important channels have more precision
quantized = quantize(scaled_weight)

Decision Flowchart

Which Method Should You Use?

🎯

Need int8?

→ Dynamic PTQ (one line of code)

🤖

Quantizing an LLM to int4?

→ GPTQ or AWQ (most common)

📉

PTQ accuracy unacceptable?

→ Try AWQ, then QAT if needed

⚡

Need extreme compression (int2)?

→ QAT is your only option

Quick Comparison

Method	Speed	Bits	Accuracy	Best For
Dynamic PTQ	Fast	8	Good	Quick start
Static PTQ	Fast	8	Good	Production
GPTQ	Medium	4	Great	LLMs
AWQ	Medium	4	Great	LLMs
QAT	Slow	2-8	Best	Extreme

⚠️Always Measure

Quantization can fail silently. Your model might look fine on average but fail on specific inputs.

Always benchmark after quantizing:

Run standard evaluation metrics
Test edge cases
Check specific failure modes

Next Up

ℹ️Note

Theory complete! Let’s put it into practice. The next section walks through quantizing a real LLM.

Continue to Quantization in Practice.