Model Quantization
Reduce model size and improve inference speed through precision reduction
Interactive Demo: Quantization Impact
How to use: See how different quantization levels affect model size, speed, and accuracy.
FP32 Size
100 MB
INT8 Size
25 MB
Size Reduction
4x
Interactive Demo: Precision Formats
How to use: Adjust the slider to see how different precisions represent the same number.
What is Quantization?
Quantization reduces the numerical precision of model weights and activations, trading accuracy for efficiency.
Quantization Formula:
$$x_q = \text{round}\left(\frac{x - z}{s}\right)$$
$$x_{\text{dequant}} = x_q \cdot s + z$$
Where:
- $x$ - original floating-point value
- $x_q$ - quantized integer value
- $s$ - scale factor
- $z$ - zero-point offset
Precision Formats
1. FP32 (Float32) - Full Precision
$$\text{FP32: } 1 \text{ sign bit} + 8 \text{ exponent bits} + 23 \text{ mantissa bits} = 32 \text{ bits}$$
- Range: $\pm 3.4 \times 10^{38}$
- Precision: ~7 decimal digits
- Size: 4 bytes per parameter
- Use case: Training, highest accuracy requirements
2. FP16 (Float16) - Half Precision
$$\text{FP16: } 1 \text{ sign bit} + 5 \text{ exponent bits} + 10 \text{ mantissa bits} = 16 \text{ bits}$$
- Range: $\pm 6.5 \times 10^{4}$
- Precision: ~3 decimal digits
- Size: 2 bytes per parameter (50% reduction)
- Use case: Mixed precision training, faster inference
3. INT8 (8-bit Integer)
$$\text{INT8: } 8 \text{ bits (signed)} \rightarrow \text{range: } [-128, 127]$$
- Range: [-128, 127] or [0, 255] (unsigned)
- Size: 1 byte per parameter (75% reduction)
- Accuracy loss: ~1-2% typical
- Use case: Edge devices, mobile deployment
4. INT4 (4-bit Integer)
$$\text{INT4: } 4 \text{ bits (signed)} \rightarrow \text{range: } [-8, 7]$$
- Range: [-8, 7] or [0, 15] (unsigned)
- Size: 0.5 bytes per parameter (87.5% reduction)
- Accuracy loss: ~3-5% typical
- Use case: Extreme compression (LLMs like LLaMA, Mistral)
Quantization Methods
1. Post-Training Quantization (PTQ)
Quantize pre-trained model without retraining.
Steps:
1. Train model in FP32
2. Calibrate on small dataset to find optimal scale/zero-point
3. Convert weights and activations to INT8
4. Deploy quantized model
1. Train model in FP32
2. Calibrate on small dataset to find optimal scale/zero-point
3. Convert weights and activations to INT8
4. Deploy quantized model
- Pros: Fast, no retraining needed, minimal code changes
- Cons: Higher accuracy loss (1-3%)
- Types:
- Dynamic quantization: Weights quantized, activations computed on-the-fly
- Static quantization: Both weights and activations pre-quantized
2. Quantization-Aware Training (QAT)
Simulate quantization during training to minimize accuracy loss.
Fake Quantization:
$$\text{forward: } y = Q(Wx + b) \quad \text{(quantize then dequantize)}$$
$$\text{backward: } \text{gradients computed in FP32}$$
- Pros: Minimal accuracy loss (<1%), model adapts to quantization
- Cons: Requires retraining, more complex setup
- Use when: Accuracy is critical, have training resources
3. Mixed Precision
Use different precisions for different layers.
- Sensitive layers: Keep in FP16/FP32 (first/last layers, batch norm)
- Robust layers: Quantize to INT8 (middle conv/linear layers)
- Result: Better accuracy-performance tradeoff
Symmetric vs Asymmetric Quantization
Symmetric Quantization
$$x_q = \text{round}\left(\frac{x}{s}\right), \quad z = 0$$
$$s = \frac{\max(|x_{\min}|, |x_{\max}|)}{127}$$
- Zero-point: Always 0
- Pros: Simpler, faster computation
- Cons: Wastes range if distribution is skewed
Asymmetric Quantization
$$x_q = \text{round}\left(\frac{x - z}{s}\right)$$
$$s = \frac{x_{\max} - x_{\min}}{255}, \quad z = -\text{round}\left(\frac{x_{\min}}{s}\right)$$
- Zero-point: Adjustable offset
- Pros: Better utilizes full range, higher accuracy
- Cons: More complex, slight computational overhead
Practical Impact
Precision | Model Size | Memory | Speed | Accuracy | Use Case |
---|---|---|---|---|---|
FP32 | 100% | 100% | 1x | Baseline | Training, research |
FP16 | 50% | 50% | 2-3x | -0.1% | Cloud inference, training |
INT8 | 25% | 25% | 3-4x | -1-2% | Edge devices, mobile |
INT4 | 12.5% | 12.5% | 4-5x | -3-5% | Extreme compression (LLMs) |
Example: 100M parameter model (ResNet-50)
- FP32: 400 MB, 100 ms latency
- FP16: 200 MB, 50 ms latency, 99.9% accuracy
- INT8: 100 MB, 30 ms latency, 99% accuracy
- INT4: 50 MB, 25 ms latency, 96% accuracy
Hardware Acceleration
1. TensorRT (NVIDIA)
- Target: NVIDIA GPUs
- Optimizations: Layer fusion, kernel auto-tuning, INT8 calibration
- Speedup: 3-10x over native PyTorch/TensorFlow
- Use case: Production inference on NVIDIA hardware
2. ONNX Runtime
- Target: Cross-platform (CPU, GPU, edge)
- Optimizations: Graph optimizations, quantization, operator fusion
- Framework: Supports PyTorch, TensorFlow, scikit-learn models
- Use case: Framework-agnostic deployment
3. Neural Compute Stick / TPU
- Target: Edge devices (Raspberry Pi, mobile)
- Precision: Typically INT8 or FP16
- Power: 1-5W power consumption
- Use case: On-device inference, IoT
Quantization Workflow
Typical Pipeline:
1. Train model in FP32
2. Evaluate baseline accuracy
3. Apply post-training quantization (PTQ)
4. If accuracy drop > 2%, use QAT
5. Benchmark speed/size improvements
6. Deploy to target hardware
1. Train model in FP32
2. Evaluate baseline accuracy
3. Apply post-training quantization (PTQ)
4. If accuracy drop > 2%, use QAT
5. Benchmark speed/size improvements
6. Deploy to target hardware
Decision Tree:
- Accuracy drop < 1%? → Use PTQ (faster)
- Accuracy drop > 2%? → Use QAT (retraining)
- Extreme compression needed? → Try INT4 with QAT
- First/last layers sensitive? → Use mixed precision
Key Insights
- Size reduction: FP32→INT8 = 4x smaller, FP32→INT4 = 8x smaller
- Speed improvement: INT8 typically 2-4x faster than FP32
- Accuracy tradeoff: INT8 loses 1-2%, INT4 loses 3-5%
- PTQ vs QAT: PTQ is fast but less accurate, QAT requires retraining
- Calibration dataset: Use 100-1000 representative samples
- Layer sensitivity: First/last layers most sensitive, middle layers robust
- Hardware support: Most modern hardware has INT8 acceleration
- LLMs: INT4 quantization enables running 7B models on consumer GPUs
Common Pitfalls
- Calibration data mismatch: Use data similar to production distribution
- Batch norm layers: Fuse with previous conv layers before quantization
- Outliers: Can cause poor quantization range selection
- Per-channel vs per-tensor: Per-channel quantization more accurate
- Dynamic range: Clipping too aggressively loses information