Quantization in Practice

Time to get hands-on. We’ll quantize a real LLM and measure the tradeoffs.

The Practical Workflow

1️⃣

Load Model

→

2️⃣

Quantize

→

3️⃣

Evaluate

→

4️⃣

Deploy

Quick Start: bitsandbytes

The fastest path to quantization. One flag, and your model loads in lower precision.

float16

torch_dtype=torch.float16

Baseline

8-bit

load_in_8bit=True

2× smaller

4-bit

load_in_4bit=True

4× smaller

</>Implementation

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

MODEL = "Qwen/Qwen2.5-0.5B"

# Float16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL, torch_dtype=torch.float16, device_map="auto"
)

# 8-bit - just add one flag
model_8bit = AutoModelForCausalLM.from_pretrained(
    MODEL, load_in_8bit=True, device_map="auto"
)

# 4-bit with NF4 (best quality)
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
    MODEL, quantization_config=config, device_map="auto"
)

GPTQ for Production

For best 4-bit quality, use GPTQ with calibration data.

GPTQ Process

Calibration Data

~128 samples

Pre-trained Model

float16

→

GPTQ

Layer-by-layer

→

4-bit Model

Optimized

</>Implementation

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Prepare calibration data (use domain-representative text)
calibration_data = [
    tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    for text in your_sample_texts[:128]
]

# Configure GPTQ
config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,  # Helps with quality
)

# Quantize
model = AutoGPTQForCausalLM.from_pretrained(MODEL, config)
model.quantize(calibration_data)
model.save_quantized("./model-gptq-4bit")

Measuring Quality

⚠️Never Skip Evaluation

Quantization can fail silently. A model might look fine on average but fail on specific inputs.

Perplexity Test

Expected Results (Qwen-0.5B)

~15.2

float16

~15.4

8-bit (+1%)

~15.8

4-bit (+4%)

Lower is better

</>Implementation

from datasets import load_dataset

def evaluate_perplexity(model, tokenizer, max_samples=100):
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    total_loss, total_tokens = 0, 0

    for sample in dataset.select(range(max_samples)):
        inputs = tokenizer(sample["text"], return_tensors="pt", truncation=True)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

        with torch.no_grad():
            loss = model(**inputs, labels=inputs["input_ids"]).loss

        total_loss += loss.item() * inputs["input_ids"].shape[1]
        total_tokens += inputs["input_ids"].shape[1]

    return np.exp(total_loss / total_tokens)

# Compare
for name, model in [("FP16", model_fp16), ("8-bit", model_8bit), ("4-bit", model_4bit)]:
    ppl = evaluate_perplexity(model, tokenizer)
    print(f"{name}: {ppl:.2f}")

Speed Benchmarks

Typical Speedups (relative to float16 baseline)

float16

1.0×

int8

1.5×

int4

2.0×

Results vary by hardware and model

</>Implementation

import time

def benchmark_speed(model, tokenizer, prompt, n_tokens=100, n_runs=5):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    with torch.no_grad():
        model.generate(**inputs, max_new_tokens=10)

    # Benchmark
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(n_runs):
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=n_tokens, do_sample=False)
    torch.cuda.synchronize()

    total_time = time.time() - start
    tokens_per_sec = (n_tokens * n_runs) / total_time
    return tokens_per_sec

For RL Policies

RL has special needs:

Policy network: Speed matters most (many rollouts)
Value network: Accuracy matters most (training signal)

Solution: Quantize the policy more aggressively than the critic.

Policy Network

int8 quantization

Fast inference for rollouts

Value Network

float16 precision

Accurate value estimates

</>Implementation

class QuantizedRLPolicy(nn.Module):
    def __init__(self, policy_net, value_net):
        super().__init__()
        # Quantize policy for speed
        self.policy = torch.quantization.quantize_dynamic(
            policy_net, {nn.Linear}, dtype=torch.qint8
        )
        # Keep value at higher precision
        self.value = value_net.half()

    def forward(self, state):
        action_logits = self.policy(state)
        value = self.value(state.half())
        return action_logits, value

Complete Notebook

Run the full workflow yourself:

The notebook includes:

✓ Load Qwen-0.5B in fp16, 8-bit, 4-bit

✓ Memory comparison

✓ Perplexity evaluation

✓ Speed benchmarks

✓ GPTQ quantization

✓ Runs on free Colab T4

Quick Reference

💡Best Practices

Start with float16 → nearly lossless baseline
Try int8 first → usually good enough
Use GPTQ/AWQ for int4 → better than bitsandbytes
Always measure → perplexity, task accuracy, speed
Test edge cases → unusual inputs reveal problems

Decision Tree

•

Need quick results? → bitsandbytes load_in_8bit

•

Need best 4-bit quality? → GPTQ with calibration

•

PTQ not good enough? → Try AWQ, then QAT

•

RL policy? → int8 policy, fp16 value

Next Up

ℹ️Note

Ready to test your understanding? The next section has a summary of key concepts plus hands-on coding exercises.

Continue to Summary & Exercises.