ML Concepts • Part 4 of 5
Editor Reviewed

Quantization in Practice

Tools, techniques, and hands-on examples

Quantization in Practice

Time to get hands-on. We’ll quantize a real LLM and measure the tradeoffs.

The Practical Workflow

1️⃣
Load Model
2️⃣
Quantize
3️⃣
Evaluate
4️⃣
Deploy

Quick Start: bitsandbytes

The fastest path to quantization. One flag, and your model loads in lower precision.

float16
torch_dtype=torch.float16
Baseline
8-bit
load_in_8bit=True
2× smaller
4-bit
load_in_4bit=True
4× smaller
</>Implementation
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

MODEL = "Qwen/Qwen2.5-0.5B"

# Float16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL, torch_dtype=torch.float16, device_map="auto"
)

# 8-bit - just add one flag
model_8bit = AutoModelForCausalLM.from_pretrained(
    MODEL, load_in_8bit=True, device_map="auto"
)

# 4-bit with NF4 (best quality)
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
    MODEL, quantization_config=config, device_map="auto"
)

GPTQ for Production

For best 4-bit quality, use GPTQ with calibration data.

GPTQ Process
Calibration Data
~128 samples
+
Pre-trained Model
float16
GPTQ
Layer-by-layer
4-bit Model
Optimized
</>Implementation
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Prepare calibration data (use domain-representative text)
calibration_data = [
    tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    for text in your_sample_texts[:128]
]

# Configure GPTQ
config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,  # Helps with quality
)

# Quantize
model = AutoGPTQForCausalLM.from_pretrained(MODEL, config)
model.quantize(calibration_data)
model.save_quantized("./model-gptq-4bit")

Measuring Quality

Perplexity Test

Expected Results (Qwen-0.5B)
~15.2
float16
~15.4
8-bit (+1%)
~15.8
4-bit (+4%)
Lower is better
</>Implementation
from datasets import load_dataset

def evaluate_perplexity(model, tokenizer, max_samples=100):
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    total_loss, total_tokens = 0, 0

    for sample in dataset.select(range(max_samples)):
        inputs = tokenizer(sample["text"], return_tensors="pt", truncation=True)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

        with torch.no_grad():
            loss = model(**inputs, labels=inputs["input_ids"]).loss

        total_loss += loss.item() * inputs["input_ids"].shape[1]
        total_tokens += inputs["input_ids"].shape[1]

    return np.exp(total_loss / total_tokens)

# Compare
for name, model in [("FP16", model_fp16), ("8-bit", model_8bit), ("4-bit", model_4bit)]:
    ppl = evaluate_perplexity(model, tokenizer)
    print(f"{name}: {ppl:.2f}")

Speed Benchmarks

Typical Speedups (relative to float16 baseline)
float16
1.0×
int8
1.5×
int4
2.0×
Results vary by hardware and model
</>Implementation
import time

def benchmark_speed(model, tokenizer, prompt, n_tokens=100, n_runs=5):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    with torch.no_grad():
        model.generate(**inputs, max_new_tokens=10)

    # Benchmark
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(n_runs):
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=n_tokens, do_sample=False)
    torch.cuda.synchronize()

    total_time = time.time() - start
    tokens_per_sec = (n_tokens * n_runs) / total_time
    return tokens_per_sec

For RL Policies

RL has special needs:

  • Policy network: Speed matters most (many rollouts)
  • Value network: Accuracy matters most (training signal)

Solution: Quantize the policy more aggressively than the critic.

Policy Network
int8 quantization
Fast inference for rollouts
Value Network
float16 precision
Accurate value estimates
</>Implementation
class QuantizedRLPolicy(nn.Module):
    def __init__(self, policy_net, value_net):
        super().__init__()
        # Quantize policy for speed
        self.policy = torch.quantization.quantize_dynamic(
            policy_net, {nn.Linear}, dtype=torch.qint8
        )
        # Keep value at higher precision
        self.value = value_net.half()

    def forward(self, state):
        action_logits = self.policy(state)
        value = self.value(state.half())
        return action_logits, value

Complete Notebook

Run the full workflow yourself:

The notebook includes:

Load Qwen-0.5B in fp16, 8-bit, 4-bit

Memory comparison

Perplexity evaluation

Speed benchmarks

GPTQ quantization

Runs on free Colab T4

Quick Reference

💡Best Practices
  1. Start with float16 → nearly lossless baseline
  2. Try int8 first → usually good enough
  3. Use GPTQ/AWQ for int4 → better than bitsandbytes
  4. Always measure → perplexity, task accuracy, speed
  5. Test edge cases → unusual inputs reveal problems
Decision Tree
Need quick results? → bitsandbytes load_in_8bit
Need best 4-bit quality? → GPTQ with calibration
PTQ not good enough? → Try AWQ, then QAT
RL policy? int8 policy, fp16 value

Next Up

ℹ️Note

Ready to test your understanding? The next section has a summary of key concepts plus hands-on coding exercises.

Continue to Summary & Exercises.