Number Representations

To understand quantization, you need to understand how computers store numbers. Let’s explore the formats you’ll encounter.

Explore Float Formats

Try entering different numbers to see how they’re stored in each format. Expand “How does floating point work?” for a detailed explanation.

Float Format Explorer

See how numbers are stored in different formats

Enter a number:

Parsed value: 3.14

Try these values:

Compare formats:

float3232 bits total

Sign (1)Exponent (8)Mantissa (23)

Reconstruction formula:

value = sign × mantissa × 2^exponent

Sign

0₂ → +1

Exponent

10000000₂ = 128

128 − 127 = 1

Mantissa

1.1001...₂

= 1.570000

Result:+1 × 1.5700 × 2¹ = 3.14000

Range:±3e+38

Precision:~7 decimal digits

float1616 bits total

Sign (1)Exponent (5)Mantissa (10)

Reconstruction formula:

value = sign × mantissa × 2^exponent

Sign

0₂ → +1

Exponent

10000₂ = 16

16 − 15 = 1

Mantissa

1.1001...₂

= 1.569336

Result:+1 × 1.5693 × 2¹ = 3.13867

Range:±7e+4

Precision:~4 decimal digits

bfloat1616 bits total

Sign (1)Exponent (8)Mantissa (7)

Reconstruction formula:

value = sign × mantissa × 2^exponent

Sign

0₂ → +1

Exponent

10000000₂ = 128

128 − 127 = 1

Mantissa

1.1001...₂

= 1.562500

Result:+1 × 1.5625 × 2¹ = 3.12500

Range:±3e+38

Precision:~3 decimal digits

▶How does floating point work?

The idea: Scientific notation in binary

Just like we write 6.02 × 10²³ in decimal, computers store numbers as 1.xxxx × 2ⁿ in binary.

Sign bit

Is it positive or negative?

0 = positive, 1 = negative

Exponent

How big/small? (the power of 2)

Stored with a bias so we can represent tiny numbers too

Mantissa

The precise digits after "1."

The leading 1 is implicit (free bit!)

Why subtract a bias from the exponent?

We need both large numbers (2¹⁰⁰) and tiny numbers (2^-100). But binary integers are always positive. The solution: store exponent + bias, then subtract the bias when decoding.

Example (float32, bias=127): To store 2^-10, we save 127 + (-10) = 117. When reading, we compute 117 - 127 = -10.

Why is there an "implicit 1" in the mantissa?

In normalized binary, the first digit is always 1 (just like 6.02×10²³ always starts with a non-zero digit). Since it's always 1, we don't need to store it—we get an extra bit of precision for free!

The stored mantissa bits represent the fractional part after "1." — so 1.1011... means 1 + 1/2 + 0/4 + 1/8 + 1/16...

Key difference:

• float16: More mantissa bits = better precision for small values
• bfloat16: More exponent bits = larger range, used for training
• float32: Full precision baseline

Float16 vs BFloat16

The two 16-bit formats make different tradeoffs:

float16

E E E E E

M M M M M M M M M M

• 5 exponent bits → smaller range

• 10 mantissa bits → more precision

• Max: ~65,504

bfloat16

E E E E E E E E

M M M M M M M

• 8 exponent bits → same range as float32

• 7 mantissa bits → less precision

• Max: ~3.4×10³⁸

💡When to Use Which?

float16 has more precision (10 mantissa bits vs 7), so it produces more accurate results. Use it for inference where you don’t need to handle large gradient values.

bfloat16 has the same range as float32 (8 exponent bits), so it won’t overflow during training when gradients and intermediate activations can spike to large values. Google designed it specifically for TPU training, but it’s now widely supported on GPUs too.

📌Why Range Matters for Training

import torch

x = torch.tensor(100000.0)

# float16: OVERFLOW! Returns inf
# .half() converts to float16
x.half()  # → tensor(inf, dtype=torch.float16)

# bfloat16: Works fine (with precision loss)
x.bfloat16()  # → tensor(100352., dtype=torch.bfloat16)

Training involves large intermediate values (gradients, activations). bfloat16’s wider range prevents overflow, while float16 would produce inf and break training.

Integer Quantization

Integer quantization is simpler: we pick a scale factor and divide all values by it. The result is an integer.

For int8: we map the range [-max, +max] to [-127, +127].

Symmetric Quantization

0.247

float32

→

÷ scale

÷ 0.00195

→

round

127

→

127

int8

scale = max_value / 127

∑Mathematical Details

For b-bit symmetric quantization:

$q = \text{round}\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{2^{b-1}-1}$

To recover the approximate value:

$\hat{x} = q \times s$

</>Implementation

import numpy as np

def quantize_symmetric(x, bits=8):
    qmax = 2**(bits-1) - 1  # 127 for int8
    scale = np.max(np.abs(x)) / qmax
    q = np.round(x / scale).clip(-qmax, qmax).astype(np.int8)
    return q, scale

# Example
weights = np.array([0.247, -0.103, 0.089, -0.156])
q, scale = quantize_symmetric(weights)
print(f"Quantized: {q}")  # [127, -53, 46, -80]
print(f"Scale: {scale:.5f}")  # 0.00194

Symmetric vs Asymmetric

Symmetric

−127 ↔ 0 ↔ +127

• Zero maps to zero exactly

• Good for weights (centered around 0)

• Simpler computation

Asymmetric

0 ↔ zero_pt ↔ 255

• Adds a “zero point” offset

• Good for ReLU activations (all positive)

• Uses full range efficiently

💡Rule of Thumb

Weights: Usually symmetric (centered near zero)
Activations: Often asymmetric (ReLU outputs are positive)

Per-Tensor vs Per-Channel

Should you use one scale for the whole layer, or one scale per channel?

Different channels can have very different weight magnitudes. Per-channel quantization adapts to each.

Example: 4 channels with different scales

Channel 0

small weights

Channel 1

medium weights

Channel 2

large weights

Channel 3

tiny weights

Per-channel: each row gets its own scale → much better precision

</>Implementation

def per_channel_quantize(weights, bits=8):
    """Per-channel quantization: one scale per output channel."""
    qmax = 2**(bits-1) - 1

    # One scale per row (output channel)
    scales = np.max(np.abs(weights), axis=1) / qmax

    # Quantize each channel with its scale
    q = np.round(weights / scales[:, None]).astype(np.int8)

    return q, scales

# Comparison
weights = np.random.randn(4, 64) * np.array([[0.1], [1.0], [10.0], [0.01]])

mse_per_tensor = compute_mse(per_tensor_quantize(weights))
mse_per_channel = compute_mse(per_channel_quantize(weights))

print(f"Per-tensor MSE: {mse_per_tensor:.6f}")
print(f"Per-channel MSE: {mse_per_channel:.6f}")
# Per-channel is typically 10-100x better!

Quick Reference

float32

Range: ±3.4×10³⁸

Precision: ~7 digits

4 bytes per value

float16

Range: ±65,504

Precision: ~4 digits

2 bytes per value

bfloat16

Range: ±3.4×10³⁸

Precision: ~3 digits

2 bytes per value

int8

Range: -128 to 127

256 discrete levels

1 byte per value

int4

Range: -8 to 7

16 discrete levels

0.5 bytes per value

Next Up

ℹ️Note

Now that you understand number formats, let’s see the methods for quantizing models: PTQ, QAT, GPTQ, and AWQ.

Continue to Quantization Methods.