Part 02 Model Compression

Quantization Engineering: The 4-bit Frontier

How do you fit a 140GB model into a 24GB GPU? Through the magic—and math—of quantization.

What is Quantization?

Quantization is the process of reducing the precision of the model's weights. Instead of using 16-bit floating point numbers (FP16), we can often use 8-bit, 4-bit, or even 2-bit representations with surprisingly little loss in accuracy (perplexity).

The Major Formats

GGUF (formerly GGML)

The king of CPU + GPU inference. Created by the llama.cpp team, it allows for "sharding" models across system RAM and VRAM.

./main -m llama-3-8b.Q4_K_M.gguf -n 128

AWQ (Activation-aware Weight Quantization)

Protects the "salient" weights that contribute most to accuracy. Ideal for multi-GPU production inference via vLLM.

EXL2 (ExLlamaV2)

Optimized specifically for NVIDIA GPUs. Offers the fastest tokens-per-second for 4-bit and 6-bit quantization.

Calculating VRAM Requirements

A quick rule of thumb for 4-bit quantization:

VRAM ≈ (Params * 0.7) + Context

Example: 70B model at 4-bit ≈ 49GB VRAM + KV Cache

When to use what?

  • Home Server / Mac: Stick with **GGUF**. It's the most flexible and widely supported.
  • Cloud Production (NVIDIA): Use **AWQ**. It integrates seamlessly with vLLM and TensorRT.
  • Extreme Speed (Single GPU): Go with **EXL2**.