ML Concepts • Part 2 of 5
Editor Reviewed

Number Representations

From float32 to int8: how computers store numbers

Number Representations

To understand quantization, you need to understand how computers store numbers. Let’s explore the formats you’ll encounter.

Explore Float Formats

Try entering different numbers to see how they’re stored in each format. Expand “How does floating point work?” for a detailed explanation.

Float Format Explorer

See how numbers are stored in different formats

Parsed value: 3.14
Try these values:
Compare formats:
float3232 bits total
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
1
1
0
1
0
1
1
1
0
0
0
0
1
0
Sign (1)Exponent (8)Mantissa (23)
Reconstruction formula:
value = sign × mantissa × 2exponent
Sign
02+1
Exponent
100000002 = 128
128127 = 1
Mantissa
1.1001...2
= 1.570000
Result:+1 × 1.5700 × 21 = 3.14000
Range:±3e+38
Precision:~7 decimal digits
float1616 bits total
0
1
0
0
0
0
1
0
0
1
0
0
0
1
1
1
Sign (1)Exponent (5)Mantissa (10)
Reconstruction formula:
value = sign × mantissa × 2exponent
Sign
02+1
Exponent
100002 = 16
1615 = 1
Mantissa
1.1001...2
= 1.569336
Result:+1 × 1.5693 × 21 = 3.13867
Range:±7e+4
Precision:~4 decimal digits
bfloat1616 bits total
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
Sign (1)Exponent (8)Mantissa (7)
Reconstruction formula:
value = sign × mantissa × 2exponent
Sign
02+1
Exponent
100000002 = 128
128127 = 1
Mantissa
1.1001...2
= 1.562500
Result:+1 × 1.5625 × 21 = 3.12500
Range:±3e+38
Precision:~3 decimal digits
How does floating point work?
The idea: Scientific notation in binary
Just like we write 6.02 × 1023 in decimal, computers store numbers as 1.xxxx × 2n in binary.
Sign bit
Is it positive or negative?
0 = positive, 1 = negative
Exponent
How big/small? (the power of 2)
Stored with a bias so we can represent tiny numbers too
Mantissa
The precise digits after "1."
The leading 1 is implicit (free bit!)
Why subtract a bias from the exponent?
We need both large numbers (2100) and tiny numbers (2-100). But binary integers are always positive. The solution: store exponent + bias, then subtract the bias when decoding.
Example (float32, bias=127): To store 2-10, we save 127 + (-10) = 117. When reading, we compute 117 - 127 = -10.
Why is there an "implicit 1" in the mantissa?
In normalized binary, the first digit is always 1 (just like 6.02×1023 always starts with a non-zero digit). Since it's always 1, we don't need to store it—we get an extra bit of precision for free!
The stored mantissa bits represent the fractional part after "1." — so 1.1011... means 1 + 1/2 + 0/4 + 1/8 + 1/16...
Key difference:
  • float16: More mantissa bits = better precision for small values
  • bfloat16: More exponent bits = larger range, used for training
  • float32: Full precision baseline

Float16 vs BFloat16

The two 16-bit formats make different tradeoffs:

float16
S
E E E E E
M M M M M M M M M M
5 exponent bits → smaller range
10 mantissa bits → more precision
• Max: ~65,504
bfloat16
S
E E E E E E E E
M M M M M M M
8 exponent bits → same range as float32
7 mantissa bits → less precision
• Max: ~3.4×10³⁸
💡When to Use Which?

float16 has more precision (10 mantissa bits vs 7), so it produces more accurate results. Use it for inference where you don’t need to handle large gradient values.

bfloat16 has the same range as float32 (8 exponent bits), so it won’t overflow during training when gradients and intermediate activations can spike to large values. Google designed it specifically for TPU training, but it’s now widely supported on GPUs too.

📌Why Range Matters for Training
import torch

x = torch.tensor(100000.0)

# float16: OVERFLOW! Returns inf
# .half() converts to float16
x.half()  # → tensor(inf, dtype=torch.float16)

# bfloat16: Works fine (with precision loss)
x.bfloat16()  # → tensor(100352., dtype=torch.bfloat16)

Training involves large intermediate values (gradients, activations). bfloat16’s wider range prevents overflow, while float16 would produce inf and break training.

Integer Quantization

Integer quantization is simpler: we pick a scale factor and divide all values by it. The result is an integer.

For int8: we map the range [-max, +max] to [-127, +127].

Symmetric Quantization
0.247
float32
÷ scale
÷ 0.00195
round
127
127
int8

scale = max_value / 127

Mathematical Details

For b-bit symmetric quantization:

q=round(xs),s=max(x)2b11q = \text{round}\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{2^{b-1}-1}

To recover the approximate value:

x^=q×s\hat{x} = q \times s

</>Implementation
import numpy as np

def quantize_symmetric(x, bits=8):
    qmax = 2**(bits-1) - 1  # 127 for int8
    scale = np.max(np.abs(x)) / qmax
    q = np.round(x / scale).clip(-qmax, qmax).astype(np.int8)
    return q, scale

# Example
weights = np.array([0.247, -0.103, 0.089, -0.156])
q, scale = quantize_symmetric(weights)
print(f"Quantized: {q}")  # [127, -53, 46, -80]
print(f"Scale: {scale:.5f}")  # 0.00194

Symmetric vs Asymmetric

Symmetric
−127 ↔ 0 ↔ +127
• Zero maps to zero exactly
• Good for weights (centered around 0)
• Simpler computation
Asymmetric
0 ↔ zero_pt ↔ 255
• Adds a “zero point” offset
• Good for ReLU activations (all positive)
• Uses full range efficiently
💡Rule of Thumb
  • Weights: Usually symmetric (centered near zero)
  • Activations: Often asymmetric (ReLU outputs are positive)

Per-Tensor vs Per-Channel

Should you use one scale for the whole layer, or one scale per channel?

Different channels can have very different weight magnitudes. Per-channel quantization adapts to each.

Example: 4 channels with different scales
Channel 0
small weights
Channel 1
medium weights
Channel 2
large weights
Channel 3
tiny weights

Per-channel: each row gets its own scale → much better precision

</>Implementation
def per_channel_quantize(weights, bits=8):
    """Per-channel quantization: one scale per output channel."""
    qmax = 2**(bits-1) - 1

    # One scale per row (output channel)
    scales = np.max(np.abs(weights), axis=1) / qmax

    # Quantize each channel with its scale
    q = np.round(weights / scales[:, None]).astype(np.int8)

    return q, scales

# Comparison
weights = np.random.randn(4, 64) * np.array([[0.1], [1.0], [10.0], [0.01]])

mse_per_tensor = compute_mse(per_tensor_quantize(weights))
mse_per_channel = compute_mse(per_channel_quantize(weights))

print(f"Per-tensor MSE: {mse_per_tensor:.6f}")
print(f"Per-channel MSE: {mse_per_channel:.6f}")
# Per-channel is typically 10-100x better!

Quick Reference

float32
Range: ±3.4×10³⁸
Precision: ~7 digits
4 bytes per value
float16
Range: ±65,504
Precision: ~4 digits
2 bytes per value
bfloat16
Range: ±3.4×10³⁸
Precision: ~3 digits
2 bytes per value
int8
Range: -128 to 127
256 discrete levels
1 byte per value
int4
Range: -8 to 7
16 discrete levels
0.5 bytes per value

Next Up

ℹ️Note

Now that you understand number formats, let’s see the methods for quantizing models: PTQ, QAT, GPTQ, and AWQ.

Continue to Quantization Methods.