Number Representations
To understand quantization, you need to understand how computers store numbers. Let’s explore the formats you’ll encounter.
Explore Float Formats
Try entering different numbers to see how they’re stored in each format. Expand “How does floating point work?” for a detailed explanation.
Float Format Explorer
See how numbers are stored in different formats
▶How does floating point work?
- • float16: More mantissa bits = better precision for small values
- • bfloat16: More exponent bits = larger range, used for training
- • float32: Full precision baseline
Float16 vs BFloat16
The two 16-bit formats make different tradeoffs:
float32float16 has more precision (10 mantissa bits vs 7), so it produces more accurate results. Use it for inference where you don’t need to handle large gradient values.
bfloat16 has the same range as float32 (8 exponent bits), so it won’t overflow during training when gradients and intermediate activations can spike to large values. Google designed it specifically for TPU training, but it’s now widely supported on GPUs too.
import torch
x = torch.tensor(100000.0)
# float16: OVERFLOW! Returns inf
# .half() converts to float16
x.half() # → tensor(inf, dtype=torch.float16)
# bfloat16: Works fine (with precision loss)
x.bfloat16() # → tensor(100352., dtype=torch.bfloat16)Training involves large intermediate values (gradients, activations). bfloat16’s wider range prevents overflow, while float16 would produce inf and break training.
Integer Quantization
Integer quantization is simpler: we pick a scale factor and divide all values by it. The result is an integer.
For int8: we map the range [-max, +max] to [-127, +127].
scale = max_value / 127
For b-bit symmetric quantization:
To recover the approximate value:
import numpy as np
def quantize_symmetric(x, bits=8):
qmax = 2**(bits-1) - 1 # 127 for int8
scale = np.max(np.abs(x)) / qmax
q = np.round(x / scale).clip(-qmax, qmax).astype(np.int8)
return q, scale
# Example
weights = np.array([0.247, -0.103, 0.089, -0.156])
q, scale = quantize_symmetric(weights)
print(f"Quantized: {q}") # [127, -53, 46, -80]
print(f"Scale: {scale:.5f}") # 0.00194Symmetric vs Asymmetric
- Weights: Usually symmetric (centered near zero)
- Activations: Often asymmetric (ReLU outputs are positive)
Per-Tensor vs Per-Channel
Should you use one scale for the whole layer, or one scale per channel?
Different channels can have very different weight magnitudes. Per-channel quantization adapts to each.
Per-channel: each row gets its own scale → much better precision
def per_channel_quantize(weights, bits=8):
"""Per-channel quantization: one scale per output channel."""
qmax = 2**(bits-1) - 1
# One scale per row (output channel)
scales = np.max(np.abs(weights), axis=1) / qmax
# Quantize each channel with its scale
q = np.round(weights / scales[:, None]).astype(np.int8)
return q, scales
# Comparison
weights = np.random.randn(4, 64) * np.array([[0.1], [1.0], [10.0], [0.01]])
mse_per_tensor = compute_mse(per_tensor_quantize(weights))
mse_per_channel = compute_mse(per_channel_quantize(weights))
print(f"Per-tensor MSE: {mse_per_tensor:.6f}")
print(f"Per-channel MSE: {mse_per_channel:.6f}")
# Per-channel is typically 10-100x better!Quick Reference
Next Up
Now that you understand number formats, let’s see the methods for quantizing models: PTQ, QAT, GPTQ, and AWQ.
Continue to Quantization Methods.