# Model Quantization Guide Complete guide to quantizing neural models in Nasty for deployment optimization. ## Overview Model quantization reduces model size and inference time by converting Float32 weights to lower-precision representations (INT8, INT4). This enables: - **4x smaller models** (400MB → 100MB) - **2-3x faster inference** on CPU - **40-60% lower memory usage** - **Minimal accuracy loss** (<1% with proper calibration) - **Mobile and edge deployment** with reduced resource requirements ## Quantization Methods Nasty supports three quantization approaches: ### 1. INT8 Post-Training Quantization (Recommended) Convert trained Float32 models to INT8 after training. **Advantages:** - No retraining required - Fast conversion (minutes) - <1% accuracy degradation - Works with any trained model **Use when:** - You have a trained model ready for deployment - You need quick optimization - Accuracy requirements are not extremely strict (>97%) ```elixir alias Nasty.Statistics.Neural.Quantization.INT8 # Load trained model {:ok, model} = NeuralTagger.load("models/pos_tagger.axon") # Prepare calibration data (100-1000 representative samples) calibration_data = load_calibration_samples("data/calibration.conllu", limit: 500) # Quantize {:ok, quantized} = INT8.quantize(model, calibration_data: calibration_data, calibration_method: :percentile, # More robust than :minmax target_accuracy_loss: 0.01 # Max 1% loss ) # Save INT8.save(quantized, "models/pos_tagger_int8.axon") ``` ### 2. Dynamic Quantization Quantize weights at load time, keep activations in Float32. **Advantages:** - No calibration data needed - Faster than static quantization - Easy to apply **Disadvantages:** - Slower inference than INT8 (activations still Float32) - 50% smaller (not 75% like INT8) **Use when:** - You don't have calibration data - You need quick wins without accuracy concerns - Memory is more constrained than compute ```elixir alias Nasty.Statistics.Neural.Quantization.Dynamic {:ok, model} = NeuralTagger.load("models/pos_tagger.axon") # Quantize dynamically {:ok, quantized} = Dynamic.quantize(model) # Use immediately - no saving needed {:ok, predictions} = Dynamic.predict(quantized, tokens) ``` ### 3. Quantization-Aware Training (QAT) Train model with quantization simulation from the start. **Advantages:** - Best accuracy (no degradation) - Handles quantization errors during training - Optimal for production **Disadvantages:** - Requires retraining - Longer training time (1.5-2x) - More complex setup **Use when:** - Accuracy is critical (medical, legal, finance) - You're training from scratch anyway - You have time for proper training ```elixir alias Nasty.Statistics.Neural.Quantization.QAT alias Nasty.Statistics.Neural.Transformers.FineTuner # Fine-tune with QAT enabled {:ok, model} = FineTuner.fine_tune( base_model, training_data, :pos_tagging, epochs: 5, quantization_aware: true, # Enable QAT qat_opts: [ bits: 8, fake_quantize: true ] ) # Model is already quantization-ready QAT.save(model, "models/pos_tagger_qat_int8.axon") ``` ## Calibration Data Calibration determines optimal quantization ranges for activations. ### Requirements - **Size**: 100-1000 samples (more is better, diminishing returns after 1000) - **Representativeness**: Must cover typical input distributions - **Format**: Same as training data (tokens, sentences, etc.) ### Preparing Calibration Data ```elixir # From CoNLL-U file defmodule CalibrationLoader do def load_samples(path, opts \\ []) do limit = Keyword.get(opts, :limit, 500) path |> DataLoader.load_conllu_file() |> elem(1) |> Enum.take(limit) |> Enum.map(fn sentence -> # Convert to format expected by model %{ input_ids: sentence.input_ids, attention_mask: sentence.attention_mask } end) end end calibration_data = CalibrationLoader.load_samples("data/dev.conllu", limit: 500) ``` ### Calibration Methods **MinMax** (`:minmax`): - Uses absolute min/max of activations - Fast but sensitive to outliers - Default method ```elixir INT8.quantize(model, calibration_data: data, calibration_method: :minmax) ``` **Percentile** (`:percentile`): - Uses 99.99th percentile instead of absolute max - More robust to outliers - Recommended for production ```elixir INT8.quantize(model, calibration_data: data, calibration_method: :percentile, percentile: 99.99 ) ``` **Entropy** (`:entropy`): - Minimizes KL divergence between FP32 and INT8 - Best accuracy but slowest - Use for critical applications ```elixir INT8.quantize(model, calibration_data: data, calibration_method: :entropy ) ``` ## Model Comparison ### Before Quantization ```bash # Original Float32 model ls -lh models/pos_tagger.axon # => 412M # Inference time (CPU) mix nasty.benchmark --model pos_tagger.axon # => 45ms per sentence ``` ### After INT8 Quantization ```bash # Quantized INT8 model ls -lh models/pos_tagger_int8.axon # => 108M (3.8x smaller) # Inference time (CPU) mix nasty.benchmark --model pos_tagger_int8.axon # => 18ms per sentence (2.5x faster) ``` ### Accuracy Comparison ```bash # Evaluate both models mix nasty.eval --model models/pos_tagger.axon --test data/test.conllu # => Accuracy: 97.8% mix nasty.eval --model models/pos_tagger_int8.axon --test data/test.conllu # => Accuracy: 97.4% (0.4% degradation) ``` ## Mix Tasks ### Quantize Existing Model ```bash mix nasty.quantize \ --model models/pos_tagger.axon \ --calibration data/calibration.conllu \ --method percentile \ --output models/pos_tagger_int8.axon ``` ### Evaluate Quantized Model ```bash mix nasty.quantize.eval \ --original models/pos_tagger.axon \ --quantized models/pos_tagger_int8.axon \ --test data/test.conllu ``` Output: ``` Comparing models on 2000 test examples: Original (Float32): Accuracy: 97.84% Memory: 412MB Avg inference: 45.3ms Quantized (INT8): Accuracy: 97.41% Memory: 108MB Avg inference: 18.2ms Summary: Size reduction: 3.8x Speed improvement: 2.5x Accuracy loss: 0.43% ``` ### Estimate Size Reduction ```bash mix nasty.quantize.estimate --model models/pos_tagger.axon ``` Output: ``` Model: models/pos_tagger.axon Parameters: 125,000,000 Estimated sizes: Float32 (current): 412 MB INT8: 108 MB (3.8x smaller) INT4: 58 MB (7.1x smaller) Memory usage: Float32: ~1.2 GB (with activations) INT8: ~350 MB (70% reduction) ``` ## Advanced Options ### Per-Channel Quantization Quantize each output channel separately for better accuracy: ```elixir INT8.quantize(model, calibration_data: data, per_channel: true # Default ) ``` ### Symmetric vs Asymmetric **Symmetric** (default, faster): ```elixir INT8.quantize(model, symmetric: true) # Range: [-127, 127], zero_point = 0 ``` **Asymmetric** (better accuracy): ```elixir INT8.quantize(model, symmetric: false) # Range: [-128, 127], zero_point = computed ``` ### Selective Quantization Quantize only certain layers: ```elixir INT8.quantize(model, calibration_data: data, skip_layers: ["embedding", "output"] # Keep these in Float32 ) ``` ## Deployment Strategies ### CPU Deployment INT8 quantization provides maximum speedup on CPU: ```elixir # Production inference {:ok, model} = INT8.load("models/pos_tagger_int8.axon") def tag_text(text) do {:ok, tokens} = Tokenizer.tokenize(text) {:ok, tagged} = INT8.predict(model, tokens) tagged end ``` ### GPU Deployment Limited benefits on GPU (GPUs are optimized for Float32): ```elixir # Use Float32 on GPU, INT8 on CPU model = if gpu_available?() do {:ok, m} = NeuralTagger.load("models/pos_tagger.axon") m else {:ok, m} = INT8.load("models/pos_tagger_int8.axon") m end ``` ### Mobile/Edge Deployment Essential for resource-constrained devices: ```elixir # Aggressive quantization for mobile {:ok, model} = INT8.quantize(full_model, calibration_data: data, calibration_method: :percentile, per_channel: true, compress: true # Additional gzip compression ) # Further optimize {:ok, pruned} = Pruner.prune(model, sparsity: 0.3) {:ok, distilled} = Distiller.distill(pruned, student_size: 0.5) ``` ## Troubleshooting ### High Accuracy Loss **Problem**: Accuracy drops >2% after quantization **Solutions**: 1. Use more calibration data (increase from 100 to 1000 samples) 2. Switch to percentile method with higher percentile (99.99) 3. Use asymmetric quantization 4. Skip quantizing sensitive layers (embedding, output) 5. Try QAT for best accuracy ```elixir # Better calibration INT8.quantize(model, calibration_data: more_samples, # 1000 instead of 100 calibration_method: :percentile, percentile: 99.99, symmetric: false ) ``` ### Slow Quantization **Problem**: Calibration takes too long **Solutions**: 1. Reduce calibration sample size 2. Use minmax instead of entropy method 3. Disable per-channel quantization ```elixir # Faster quantization INT8.quantize(model, calibration_data: fewer_samples, # 100 instead of 1000 calibration_method: :minmax, per_channel: false ) ``` ### Large Model Size **Problem**: INT8 model still too large **Solutions**: 1. Apply model pruning first 2. Use knowledge distillation 3. Consider INT4 quantization (more aggressive) ```elixir # Aggressive optimization pipeline {:ok, pruned} = Pruner.prune(model, sparsity: 0.4) {:ok, quantized} = INT8.quantize(pruned, calibration_data: data) {:ok, compressed} = Compressor.compress(quantized, method: :gzip) ``` ## Best Practices ### 1. Always Validate Accuracy ```elixir # Validate before deploying {:ok, quantized} = INT8.quantize(model, calibration_data: data, target_accuracy_loss: 0.01 # Fail if >1% loss ) ``` ### 2. Use Representative Calibration Data ```elixir # BAD: Only formal text calibration_data = load_samples("formal_documents.txt") # GOOD: Mixed domains matching production calibration_data = load_samples("news.txt", 100) ++ load_samples("social_media.txt", 100) ++ load_samples("technical.txt", 100) ``` ### 3. Benchmark in Production Environment ```bash # Test on actual deployment hardware mix nasty.benchmark \ --model models/pos_tagger_int8.axon \ --environment production \ --samples 1000 ``` ### 4. Version Your Quantized Models ``` models/ pos_tagger_v1_fp32.axon # Original pos_tagger_v1_int8_minmax.axon # Quick quantization pos_tagger_v1_int8_percentile.axon # Production quantization pos_tagger_v1_qat.axon # Quantization-aware trained ``` ## Performance Metrics ### POS Tagging (UD English) | Model | Size | Inference (CPU) | Accuracy | Use Case | |-------|------|----------------|----------|----------| | Float32 | 412MB | 45ms | 97.8% | GPU servers | | INT8 (minmax) | 108MB | 19ms | 97.2% | Fast deployment | | INT8 (percentile) | 108MB | 18ms | 97.4% | Production | | INT8 QAT | 108MB | 18ms | 97.8% | Critical apps | ### NER (CoNLL-2003) | Model | Size | Inference (CPU) | F1 Score | Use Case | |-------|------|----------------|----------|----------| | Float32 | 380MB | 52ms | 94.2% | Research | | INT8 | 98MB | 21ms | 93.5% | Production | ## See Also - [NEURAL_MODELS.md](NEURAL_MODELS.md) - Neural architecture details - [FINE_TUNING.md](FINE_TUNING.md) - Training custom models - [PRETRAINED_MODELS.md](PRETRAINED_MODELS.md) - Using transformers - [Model Compression Papers](https://arxiv.org/abs/2010.03954)