Model Quantization - Data Science Cheatsheet

Interactive Demo: Quantization Impact

How to use: See how different quantization levels affect model size, speed, and accuracy.

FP32 Size

100 MB

INT8 Size

25 MB

Size Reduction

4x

Interactive Demo: Precision Formats

How to use: Adjust the slider to see how different precisions represent the same number.

Value: 3.14159

What is Quantization?

Quantization reduces the numerical precision of model weights and activations, trading accuracy for efficiency.

Quantization Formula: $$x_q = \text{round}\left(\frac{x - z}{s}\right)$$ $$x_{\text{dequant}} = x_q \cdot s + z$$

Where:

$x$ - original floating-point value
$x_q$ - quantized integer value
$s$ - scale factor
$z$ - zero-point offset

Precision Formats

1. FP32 (Float32) - Full Precision

$$\text{FP32: } 1 \text{ sign bit} + 8 \text{ exponent bits} + 23 \text{ mantissa bits} = 32 \text{ bits}$$

Range: $\pm 3.4 \times 10^{38}$
Precision: ~7 decimal digits
Size: 4 bytes per parameter
Use case: Training, highest accuracy requirements

2. FP16 (Float16) - Half Precision

$$\text{FP16: } 1 \text{ sign bit} + 5 \text{ exponent bits} + 10 \text{ mantissa bits} = 16 \text{ bits}$$

Range: $\pm 6.5 \times 10^{4}$
Precision: ~3 decimal digits
Size: 2 bytes per parameter (50% reduction)
Use case: Mixed precision training, faster inference

3. INT8 (8-bit Integer)

$$\text{INT8: } 8 \text{ bits (signed)} \rightarrow \text{range: } [-128, 127]$$

Range: [-128, 127] or [0, 255] (unsigned)
Size: 1 byte per parameter (75% reduction)
Accuracy loss: ~1-2% typical
Use case: Edge devices, mobile deployment

4. INT4 (4-bit Integer)

$$\text{INT4: } 4 \text{ bits (signed)} \rightarrow \text{range: } [-8, 7]$$

Range: [-8, 7] or [0, 15] (unsigned)
Size: 0.5 bytes per parameter (87.5% reduction)
Accuracy loss: ~3-5% typical
Use case: Extreme compression (LLMs like LLaMA, Mistral)

Quantization Methods

1. Post-Training Quantization (PTQ)

Quantize pre-trained model without retraining.

Steps:
1. Train model in FP32
2. Calibrate on small dataset to find optimal scale/zero-point
3. Convert weights and activations to INT8
4. Deploy quantized model

Pros: Fast, no retraining needed, minimal code changes
Cons: Higher accuracy loss (1-3%)
Types:
- Dynamic quantization: Weights quantized, activations computed on-the-fly
- Static quantization: Both weights and activations pre-quantized

2. Quantization-Aware Training (QAT)

Simulate quantization during training to minimize accuracy loss.

Fake Quantization: $$\text{forward: } y = Q(Wx + b) \quad \text{(quantize then dequantize)}$$ $$\text{backward: } \text{gradients computed in FP32}$$

Pros: Minimal accuracy loss (<1%), model adapts to quantization
Cons: Requires retraining, more complex setup
Use when: Accuracy is critical, have training resources

3. Mixed Precision

Use different precisions for different layers.

Sensitive layers: Keep in FP16/FP32 (first/last layers, batch norm)
Robust layers: Quantize to INT8 (middle conv/linear layers)
Result: Better accuracy-performance tradeoff

Symmetric vs Asymmetric Quantization

Symmetric Quantization

$$x_q = \text{round}\left(\frac{x}{s}\right), \quad z = 0$$ $$s = \frac{\max(|x_{\min}|, |x_{\max}|)}{127}$$

Zero-point: Always 0
Pros: Simpler, faster computation
Cons: Wastes range if distribution is skewed

Asymmetric Quantization

$$x_q = \text{round}\left(\frac{x - z}{s}\right)$$ $$s = \frac{x_{\max} - x_{\min}}{255}, \quad z = -\text{round}\left(\frac{x_{\min}}{s}\right)$$

Zero-point: Adjustable offset
Pros: Better utilizes full range, higher accuracy
Cons: More complex, slight computational overhead

Practical Impact

Precision	Model Size	Memory	Speed	Accuracy	Use Case
FP32	100%	100%	1x	Baseline	Training, research
FP16	50%	50%	2-3x	-0.1%	Cloud inference, training
INT8	25%	25%	3-4x	-1-2%	Edge devices, mobile
INT4	12.5%	12.5%	4-5x	-3-5%	Extreme compression (LLMs)

Example: 100M parameter model (ResNet-50)

FP32: 400 MB, 100 ms latency
FP16: 200 MB, 50 ms latency, 99.9% accuracy
INT8: 100 MB, 30 ms latency, 99% accuracy
INT4: 50 MB, 25 ms latency, 96% accuracy

Hardware Acceleration

1. TensorRT (NVIDIA)

Target: NVIDIA GPUs
Optimizations: Layer fusion, kernel auto-tuning, INT8 calibration
Speedup: 3-10x over native PyTorch/TensorFlow
Use case: Production inference on NVIDIA hardware

2. ONNX Runtime

Target: Cross-platform (CPU, GPU, edge)
Optimizations: Graph optimizations, quantization, operator fusion
Framework: Supports PyTorch, TensorFlow, scikit-learn models
Use case: Framework-agnostic deployment

3. Neural Compute Stick / TPU

Target: Edge devices (Raspberry Pi, mobile)
Precision: Typically INT8 or FP16
Power: 1-5W power consumption
Use case: On-device inference, IoT

Quantization Workflow

Typical Pipeline:
1. Train model in FP32
2. Evaluate baseline accuracy
3. Apply post-training quantization (PTQ)
4. If accuracy drop > 2%, use QAT
5. Benchmark speed/size improvements
6. Deploy to target hardware

Decision Tree:

Accuracy drop < 1%? → Use PTQ (faster)
Accuracy drop > 2%? → Use QAT (retraining)
Extreme compression needed? → Try INT4 with QAT
First/last layers sensitive? → Use mixed precision

Key Insights

Size reduction: FP32→INT8 = 4x smaller, FP32→INT4 = 8x smaller
Speed improvement: INT8 typically 2-4x faster than FP32
Accuracy tradeoff: INT8 loses 1-2%, INT4 loses 3-5%
PTQ vs QAT: PTQ is fast but less accurate, QAT requires retraining
Calibration dataset: Use 100-1000 representative samples
Layer sensitivity: First/last layers most sensitive, middle layers robust
Hardware support: Most modern hardware has INT8 acceleration
LLMs: INT4 quantization enables running 7B models on consumer GPUs

Common Pitfalls

Calibration data mismatch: Use data similar to production distribution
Batch norm layers: Fuse with previous conv layers before quantization
Outliers: Can cause poor quantization range selection
Per-channel vs per-tensor: Per-channel quantization more accurate
Dynamic range: Clipping too aggressively loses information