Why Quantization Matters

Neural networks are getting bigger every year. LLaMA 3 has 405 billion parameters. Storing and running these models requires serious hardware—unless you quantize.

The Memory Problem

Try it yourself: Select a model size and see how quantization shrinks memory requirements.

Memory Calculator

See how quantization shrinks model memory

Select a model:

Or enter custom size:

billion parameters

Llama 3-7B

7B parameters

Memory requirements by precision:

float3228.0 GB

✓ Fits on A100 40GB

float1614.0 GB

2× smaller

✓ Fits on RTX 4080

int87.0 GB

4× smaller

✓ Fits on RTX 3060

int43.5 GB

8× smaller

✓ Fits on RTX 3060

💡

The takeaway: A 7B model that needs 28.0 GB in float32 can fit in just 3.5 GB with int4 quantization— that's 8× smaller.

Common GPU VRAM:

RTX 3060: 12GBRTX 4070: 12GBRTX 3080: 10GBRTX 4080: 16GBRTX 4090: 24GB

The math is simple: Every parameter takes up space. float32 uses 4 bytes per number.

A 7B model = 7 billion × 4 bytes = 28 GB.

But int4 uses only 0.5 bytes per number. Same model = 7 billion × 0.5 bytes = 3.5 GB.

That’s the difference between “needs a data center” and “runs on a laptop.”

∑Mathematical Details

Memory footprint:

$\text{Memory} = N_{\text{params}} \times \frac{b}{8} \text{ bytes}$

where $N_{\text{params}}$ is the parameter count and $b$ is bits per parameter.

The Speed Problem

Memory size isn’t just about fitting—it’s about speed.

The GPU Bottleneck

📦

Memory

28 GB weights

→→→

Bandwidth Limited

⚡

Compute

Fast!

For each token, the GPU loads all weights from memory.
Smaller weights = faster loading = more tokens/sec.

💡Why load weights every time?

GPU compute cores have tiny caches (a few MB). A 7B model is 28GB in float32—the weights simply don’t fit. They must be streamed from memory for every forward pass. This is called being memory-bound: the GPU spends most of its time waiting for data, not computing.

📌Real Numbers: A100 Token Generation

70B model in float32 (280 GB):

Load time: 280 GB ÷ 2000 GB/s = 140 ms
Result: ~7 tokens/second

Same model in int4 (35 GB):

Load time: 35 GB ÷ 2000 GB/s = 17.5 ms
Result: ~57 tokens/second

8× faster from quantization alone!

Why It Works

Here’s the surprising part: neural networks barely notice.

🔢

Overparameterized

Networks have way more parameters than needed. Small noise doesn’t matter.

📉

Flat Minima

Well-trained networks sit in regions where small weight changes don’t affect outputs.

🎯

Pattern Matters

The activation pattern is preserved even with reduced precision.

</>Implementation

import torch
import torch.nn as nn

# Quick experiment: how much does int8 hurt?
torch.manual_seed(42)
layer = nn.Linear(512, 512)
x = torch.randn(32, 512)

with torch.no_grad():
    y_original = layer(x)

# Simple symmetric quantization
def quantize_int8(t):
    scale = t.abs().max() / 127
    return torch.round(t / scale).clamp(-128, 127) * scale

layer.weight.data = quantize_int8(layer.weight.data)

with torch.no_grad():
    y_quantized = layer(x)

error = (y_original - y_quantized).abs().mean() / y_original.abs().mean()
print(f"Relative error: {error:.2%}")  # Typically < 1%

When Does It Hurt?

⚠️Watch Out For

✗

Outlier weights

A few large weights waste quantization range

✗

Math-heavy tasks

Precise calculations suffer more

✗

Very small models

Less redundancy to absorb error

✗

Extreme compression

int2/int3 needs special techniques

Always measure accuracy after quantization.

Quick Reference

float32

Training baseline

Full precision

float16

Training + inference

2× smaller

int8

Standard inference

4× smaller

int4

Edge / LLMs

8× smaller

💡Quantization in RL Pipelines

RL systems have distinct phases with different precision needs:

Training neural networks: Use float16 or bfloat16. You need gradient precision, but full float32 is overkill. Mixed-precision training gives you 2× speedup with minimal accuracy loss.
Collecting rollouts: Use int8 for policy inference. When running thousands of parallel environments, inference speed matters more than perfect precision. The policy just needs to pick good actions.
Edge deployment: Use int4 when deploying to robots, drones, or embedded devices. Memory and power are constrained, and real-time response matters more than optimal actions.

Next Up

ℹ️Note

Now that you understand why quantization works, let’s see how numbers are stored at the bit level.

Continue to Number Representations.