Model Quantization

Reduce model size and improve inference speed through precision reduction

Interactive Demo: Quantization Impact

How to use: See how different quantization levels affect model size, speed, and accuracy.
FP32 Size
100 MB
INT8 Size
25 MB
Size Reduction
4x

Interactive Demo: Precision Formats

How to use: Adjust the slider to see how different precisions represent the same number.

What is Quantization?

Quantization reduces the numerical precision of model weights and activations, trading accuracy for efficiency.

Quantization Formula: $$x_q = \text{round}\left(\frac{x - z}{s}\right)$$ $$x_{\text{dequant}} = x_q \cdot s + z$$

Where:

Precision Formats

1. FP32 (Float32) - Full Precision

$$\text{FP32: } 1 \text{ sign bit} + 8 \text{ exponent bits} + 23 \text{ mantissa bits} = 32 \text{ bits}$$

2. FP16 (Float16) - Half Precision

$$\text{FP16: } 1 \text{ sign bit} + 5 \text{ exponent bits} + 10 \text{ mantissa bits} = 16 \text{ bits}$$

3. INT8 (8-bit Integer)

$$\text{INT8: } 8 \text{ bits (signed)} \rightarrow \text{range: } [-128, 127]$$

4. INT4 (4-bit Integer)

$$\text{INT4: } 4 \text{ bits (signed)} \rightarrow \text{range: } [-8, 7]$$

Quantization Methods

1. Post-Training Quantization (PTQ)

Quantize pre-trained model without retraining.

Steps:
1. Train model in FP32
2. Calibrate on small dataset to find optimal scale/zero-point
3. Convert weights and activations to INT8
4. Deploy quantized model

2. Quantization-Aware Training (QAT)

Simulate quantization during training to minimize accuracy loss.

Fake Quantization: $$\text{forward: } y = Q(Wx + b) \quad \text{(quantize then dequantize)}$$ $$\text{backward: } \text{gradients computed in FP32}$$

3. Mixed Precision

Use different precisions for different layers.

Symmetric vs Asymmetric Quantization

Symmetric Quantization

$$x_q = \text{round}\left(\frac{x}{s}\right), \quad z = 0$$ $$s = \frac{\max(|x_{\min}|, |x_{\max}|)}{127}$$

Asymmetric Quantization

$$x_q = \text{round}\left(\frac{x - z}{s}\right)$$ $$s = \frac{x_{\max} - x_{\min}}{255}, \quad z = -\text{round}\left(\frac{x_{\min}}{s}\right)$$

Practical Impact

Precision Model Size Memory Speed Accuracy Use Case
FP32 100% 100% 1x Baseline Training, research
FP16 50% 50% 2-3x -0.1% Cloud inference, training
INT8 25% 25% 3-4x -1-2% Edge devices, mobile
INT4 12.5% 12.5% 4-5x -3-5% Extreme compression (LLMs)

Example: 100M parameter model (ResNet-50)

Hardware Acceleration

1. TensorRT (NVIDIA)

2. ONNX Runtime

3. Neural Compute Stick / TPU

Quantization Workflow

Typical Pipeline:
1. Train model in FP32
2. Evaluate baseline accuracy
3. Apply post-training quantization (PTQ)
4. If accuracy drop > 2%, use QAT
5. Benchmark speed/size improvements
6. Deploy to target hardware

Decision Tree:

Key Insights

Common Pitfalls