ML Concepts • Part 1 of 5
Editor Reviewed

Why Quantization Matters

Memory, speed, and the precision tradeoff

Why Quantization Matters

Neural networks are getting bigger every year. LLaMA 3 has 405 billion parameters. Storing and running these models requires serious hardware—unless you quantize.

The Memory Problem

Try it yourself: Select a model size and see how quantization shrinks memory requirements.

Memory Calculator

See how quantization shrinks model memory

billion parameters
Llama 3-7B
7B parameters
Memory requirements by precision:
float3228.0 GB
✓ Fits on A100 40GB
float1614.0 GB
2× smaller
✓ Fits on RTX 4080
int87.0 GB
4× smaller
✓ Fits on RTX 3060
int43.5 GB
8× smaller
✓ Fits on RTX 3060
💡
The takeaway: A 7B model that needs 28.0 GB in float32 can fit in just 3.5 GB with int4 quantization— that's 8× smaller.
Common GPU VRAM:
RTX 3060: 12GBRTX 4070: 12GBRTX 3080: 10GBRTX 4080: 16GBRTX 4090: 24GB

The math is simple: Every parameter takes up space. float32 uses 4 bytes per number.

A 7B model = 7 billion × 4 bytes = 28 GB.

But int4 uses only 0.5 bytes per number. Same model = 7 billion × 0.5 bytes = 3.5 GB.

That’s the difference between “needs a data center” and “runs on a laptop.”

Mathematical Details

Memory footprint:

Memory=Nparams×b8 bytes\text{Memory} = N_{\text{params}} \times \frac{b}{8} \text{ bytes}

where NparamsN_{\text{params}} is the parameter count and bb is bits per parameter.

The Speed Problem

Memory size isn’t just about fitting—it’s about speed.

The GPU Bottleneck
📦
Memory
28 GB weights
→→→

Bandwidth Limited

Compute
Fast!

For each token, the GPU loads all weights from memory.
Smaller weights = faster loading = more tokens/sec.

💡Why load weights every time?

GPU compute cores have tiny caches (a few MB). A 7B model is 28GB in float32—the weights simply don’t fit. They must be streamed from memory for every forward pass. This is called being memory-bound: the GPU spends most of its time waiting for data, not computing.

📌Real Numbers: A100 Token Generation

70B model in float32 (280 GB):

  • Load time: 280 GB ÷ 2000 GB/s = 140 ms
  • Result: ~7 tokens/second

Same model in int4 (35 GB):

  • Load time: 35 GB ÷ 2000 GB/s = 17.5 ms
  • Result: ~57 tokens/second

8× faster from quantization alone!

Why It Works

Here’s the surprising part: neural networks barely notice.

🔢
Overparameterized
Networks have way more parameters than needed. Small noise doesn’t matter.
📉
Flat Minima
Well-trained networks sit in regions where small weight changes don’t affect outputs.
🎯
Pattern Matters
The activation pattern is preserved even with reduced precision.
</>Implementation
import torch
import torch.nn as nn

# Quick experiment: how much does int8 hurt?
torch.manual_seed(42)
layer = nn.Linear(512, 512)
x = torch.randn(32, 512)

with torch.no_grad():
    y_original = layer(x)

# Simple symmetric quantization
def quantize_int8(t):
    scale = t.abs().max() / 127
    return torch.round(t / scale).clamp(-128, 127) * scale

layer.weight.data = quantize_int8(layer.weight.data)

with torch.no_grad():
    y_quantized = layer(x)

error = (y_original - y_quantized).abs().mean() / y_original.abs().mean()
print(f"Relative error: {error:.2%}")  # Typically < 1%

When Does It Hurt?

Quick Reference

float32
Training baseline
Full precision
float16
Training + inference
2× smaller
int8
Standard inference
4× smaller
int4
Edge / LLMs
8× smaller
💡Quantization in RL Pipelines

RL systems have distinct phases with different precision needs:

  • Training neural networks: Use float16 or bfloat16. You need gradient precision, but full float32 is overkill. Mixed-precision training gives you 2× speedup with minimal accuracy loss.

  • Collecting rollouts: Use int8 for policy inference. When running thousands of parallel environments, inference speed matters more than perfect precision. The policy just needs to pick good actions.

  • Edge deployment: Use int4 when deploying to robots, drones, or embedded devices. Memory and power are constrained, and real-time response matters more than optimal actions.

Next Up

ℹ️Note

Now that you understand why quantization works, let’s see how numbers are stored at the bit level.

Continue to Number Representations.