# Training Neural Models Guide This guide provides detailed instructions for training neural models in Nasty, from data preparation to deployment. ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [Data Preparation](#data-preparation) 3. [Training POS Tagging Models](#training-pos-tagging-models) 4. [Advanced Training Options](#advanced-training-options) 5. [Model Evaluation](#model-evaluation) 6. [Troubleshooting](#troubleshooting) ## Prerequisites ### System Requirements - **Memory**: Minimum 4GB RAM for training, 8GB+ recommended - **CPU**: Multi-core CPU (4+ cores recommended) - **GPU**: Optional but highly recommended (10-100x speedup with EXLA) - **Storage**: 500MB-2GB for models and training data ### Dependencies All neural dependencies are included in `mix.exs`: ```elixir {:axon, "~> 0.7"}, {:nx, "~> 0.9"}, {:exla, "~> 0.9"}, {:bumblebee, "~> 0.6"} ``` Install with: ```bash mix deps.get ``` ### Enable GPU Acceleration (Optional) Set environment variable for EXLA to use GPU: ```bash export XLA_TARGET=cuda120 # or cuda118, rocm, etc. mix deps.compile ``` ## Data Preparation ### CoNLL-U Format Neural models train on CoNLL-U formatted data. Each sentence is separated by blank lines, with one token per line: ``` 1 The the DET DT _ 2 det _ _ 2 cat cat NOUN NN _ 3 subj _ _ 3 sat sit VERB VBD _ 0 root _ _ 1 Dogs dog NOUN NNS _ 2 subj _ _ 2 run run VERB VBP _ 0 root _ _ ``` Columns (tab-separated): 1. Index 2. Word form 3. Lemma 4. **UPOS tag** (used for training) 5. XPOS tag 6. Features 7. Head 8. Dependency relation 9-10. Additional annotations ### Where to Get Training Data **Universal Dependencies** corpora: - English: [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT) - Spanish: [UD_Spanish-GSD](https://github.com/UniversalDependencies/UD_Spanish-GSD) - Catalan: [UD_Catalan-AnCora](https://github.com/UniversalDependencies/UD_Catalan-AnCora) Download and extract: ```bash cd data git clone https://github.com/UniversalDependencies/UD_English-EWT ``` ### Data Split Recommendations - **Training**: 80% (or use provided train split) - **Validation**: 10% (or use provided dev split) - **Test**: 10% (or use provided test split) The training pipeline handles splitting automatically if you provide a single file. ## Training POS Tagging Models ### Quick Start - CLI Training The easiest way to train is using the Mix task: ```bash mix nasty.train.neural_pos \ --corpus data/UD_English-EWT/en_ewt-ud-train.conllu \ --output models/pos_neural_v1.axon \ --epochs 10 \ --batch-size 32 ``` ### CLI Options Reference ```bash mix nasty.train.neural_pos [options] Required: --corpus PATH Path to CoNLL-U training corpus Optional: --output PATH Model save path (default: pos_neural.axon) --validation PATH Path to validation corpus (auto-split if not provided) --epochs N Number of training epochs (default: 10) --batch-size N Batch size (default: 32) --learning-rate F Learning rate (default: 0.001) --hidden-size N LSTM hidden size (default: 256) --embedding-dim N Word embedding dimension (default: 300) --num-layers N Number of LSTM layers (default: 2) --dropout F Dropout rate (default: 0.3) --use-char-cnn Enable character CNN (default: enabled) --char-embedding-dim N Character embedding dim (default: 50) --optimizer NAME Optimizer: adam, sgd, adamw (default: adam) --early-stopping N Early stopping patience (default: 3) --checkpoint-dir PATH Save checkpoints during training --min-freq N Min word frequency for vocab (default: 1) --validation-split F Validation split fraction (default: 0.1) ``` ### Programmatic Training For more control, train programmatically: ```elixir alias Nasty.Statistics.POSTagging.NeuralTagger alias Nasty.Statistics.Neural.DataLoader # Load training data {:ok, sentences} = DataLoader.load_conllu_file("data/train.conllu") # Split into train/validation {train_data, valid_data} = DataLoader.split_data(sentences, validation_split: 0.1) # Create and configure tagger tagger = NeuralTagger.new(training_data: train_data) # Train with custom options {:ok, trained_tagger} = NeuralTagger.train(tagger, train_data, epochs: 20, batch_size: 32, learning_rate: 0.001, hidden_size: 512, embedding_dim: 300, num_lstm_layers: 3, dropout: 0.5, use_char_cnn: true, validation_data: valid_data, early_stopping_patience: 5 ) # Save trained model :ok = NeuralTagger.save(trained_tagger, "models/pos_advanced.axon") ``` ## Advanced Training Options ### Hyperparameter Tuning **Hidden Size** (`--hidden-size`): - Small (128-256): Faster training, less memory, slightly lower accuracy - Medium (256-512): Balanced performance (default: 256) - Large (512-1024): Best accuracy, requires more memory/time **Embedding Dimension** (`--embedding-dim`): - Small (50-100): Fast, low memory - Medium (300): Good balance (default, matches GloVe) - Large (300-1024): For very large corpora **Number of LSTM Layers** (`--num-layers`): - 1 layer: Fast, simple patterns - 2 layers: Balanced (default, recommended) - 3+ layers: Complex patterns, risk overfitting **Dropout** (`--dropout`): - 0.0: No regularization (risk overfitting) - 0.3: Good default - 0.5: Strong regularization for small datasets **Batch Size** (`--batch-size`): - Small (8-16): Better generalization, slower - Medium (32): Good balance (default) - Large (64-128): Faster training, needs more memory ### Character CNN Configuration Character-level CNN helps with out-of-vocabulary words: ```bash mix nasty.train.neural_pos \ --corpus data/train.conllu \ --use-char-cnn \ --char-embedding-dim 50 \ --char-vocab-size 150 ``` Disable if training is too slow: ```bash mix nasty.train.neural_pos \ --corpus data/train.conllu \ --no-char-cnn ``` ### Using Pre-trained Embeddings Load GloVe embeddings for better initialization: ```elixir alias Nasty.Statistics.Neural.Embeddings # Load GloVe vectors glove_embeddings = Embeddings.load_glove("data/glove.6B.300d.txt", word_vocab) # Train with pre-trained embeddings {:ok, tagger} = NeuralTagger.train(base_tagger, train_data, pretrained_embeddings: glove_embeddings, freeze_embeddings: false # Allow fine-tuning ) ``` Note: GloVe loading is currently a placeholder. Full implementation coming soon. ### Optimizer Selection **Adam** (default): - Adaptive learning rates - Works well out-of-the-box - Good for most use cases **SGD**: - Simple, stable - May need learning rate scheduling - Good baseline **AdamW**: - Adam with weight decay - Better generalization - Recommended for large models ```bash mix nasty.train.neural_pos \ --corpus data/train.conllu \ --optimizer adamw \ --learning-rate 0.0001 ``` ### Early Stopping Automatically stop training when validation performance plateaus: ```bash mix nasty.train.neural_pos \ --corpus data/train.conllu \ --validation data/dev.conllu \ --early-stopping 5 # Stop after 5 epochs without improvement ``` ### Checkpointing Save model checkpoints during training: ```bash mix nasty.train.neural_pos \ --corpus data/train.conllu \ --checkpoint-dir checkpoints/ \ --checkpoint-frequency 2 # Save every 2 epochs ``` Checkpoints are named: `checkpoint_epoch_001.axon`, `checkpoint_epoch_002.axon`, etc. ## Model Evaluation ### During Training The training task prints per-tag metrics: ``` Epoch 1/10 Loss: 0.456 Accuracy: 0.923 Per-tag accuracy: NOUN: 0.957 VERB: 0.942 DET: 0.989 ... ``` ### Post-Training Evaluation Evaluate on test set: ```bash mix nasty.eval.neural_pos \ --model models/pos_neural_v1.axon \ --test data/en_ewt-ud-test.conllu ``` Or programmatically: ```elixir {:ok, model} = NeuralTagger.load("models/pos_neural_v1.axon") {:ok, test_sentences} = DataLoader.load_conllu_file("data/test.conllu") # Evaluate correct = 0 total = 0 for {words, gold_tags} <- test_sentences do {:ok, pred_tags} = NeuralTagger.predict(model, words, []) correct = correct + Enum.count(Enum.zip(pred_tags, gold_tags), fn {p, g} -> p == g end) total = total + length(gold_tags) end accuracy = correct / total IO.puts("Accuracy: #{Float.round(accuracy * 100, 2)}%") ``` ### Metrics to Track - **Overall Accuracy**: Percentage of correctly tagged tokens - **Per-Tag Accuracy**: Accuracy for each POS tag - **Per-Tag Precision/Recall**: For detailed error analysis - **OOV Accuracy**: Performance on out-of-vocabulary words - **Training Time**: Total time and time per epoch - **Convergence**: Number of epochs to best validation score ## Troubleshooting ### Out of Memory **Symptoms**: Process crashes with memory error **Solutions**: 1. Reduce batch size: `--batch-size 16` or `--batch-size 8` 2. Reduce hidden size: `--hidden-size 128` 3. Reduce embedding dimension: `--embedding-dim 100` 4. Disable character CNN: `--no-char-cnn` 5. Use smaller training corpus subset ### Training Too Slow **Symptoms**: Hours per epoch **Solutions**: 1. Enable EXLA GPU support (see Prerequisites) 2. Increase batch size: `--batch-size 64` 3. Disable character CNN if not needed 4. Use fewer LSTM layers: `--num-layers 1` 5. Reduce hidden size: `--hidden-size 128` ### Overfitting **Symptoms**: High training accuracy, low validation accuracy **Solutions**: 1. Increase dropout: `--dropout 0.5` 2. Use more training data 3. Enable early stopping: `--early-stopping 3` 4. Reduce model complexity (fewer layers, smaller hidden size) 5. Add L2 regularization ### Underfitting **Symptoms**: Low training and validation accuracy **Solutions**: 1. Increase model capacity: `--hidden-size 512 --num-layers 3` 2. Train longer: `--epochs 20` 3. Lower dropout: `--dropout 0.2` 4. Increase learning rate: `--learning-rate 0.01` 5. Check data quality (wrong labels, formatting issues) ### Validation Loss Not Decreasing **Symptoms**: Validation loss stays flat or increases **Solutions**: 1. Lower learning rate: `--learning-rate 0.0001` 2. Add early stopping 3. Check for data issues (train/validation overlap, different distributions) 4. Try different optimizer: `--optimizer adamw` ### CoNLL-U Loading Errors **Symptoms**: Parser errors, wrong tag counts **Solutions**: 1. Verify file format (tab-separated, 10 columns) 2. Check for empty lines between sentences 3. Ensure UTF-8 encoding 4. Remove or fix malformed lines 5. Validate with UD validator: https://universaldependencies.org/tools.html ### Model Not Learning **Symptoms**: Loss stays constant, accuracy at baseline **Solutions**: 1. Check data quality (are labels correct?) 2. Verify vocabulary is being built correctly 3. Increase learning rate: `--learning-rate 0.01` 4. Remove or reduce dropout initially 5. Check for bugs in data preprocessing ## Best Practices ### For Small Datasets (<5K sentences) ```bash mix nasty.train.neural_pos \ --corpus data/small_corpus.conllu \ --epochs 20 \ --batch-size 16 \ --hidden-size 128 \ --embedding-dim 100 \ --dropout 0.5 \ --early-stopping 5 \ --no-char-cnn ``` ### For Medium Datasets (5K-50K sentences) ```bash mix nasty.train.neural_pos \ --corpus data/medium_corpus.conllu \ --epochs 15 \ --batch-size 32 \ --hidden-size 256 \ --embedding-dim 300 \ --dropout 0.3 \ --use-char-cnn \ --early-stopping 3 ``` ### For Large Datasets (50K+ sentences) ```bash mix nasty.train.neural_pos \ --corpus data/large_corpus.conllu \ --epochs 10 \ --batch-size 64 \ --hidden-size 512 \ --embedding-dim 300 \ --num-layers 3 \ --dropout 0.3 \ --use-char-cnn \ --optimizer adamw \ --learning-rate 0.0001 ``` ## Production Deployment After training, deploy your model: 1. **Save the trained model**: ```bash # Model is already saved by training task ls -lh models/pos_neural_v1.axon ``` 2. **Load in production**: ```elixir {:ok, model} = NeuralTagger.load("models/pos_neural_v1.axon") ``` 3. **Integrate with POSTagger**: ```elixir # Use neural mode {:ok, ast} = Nasty.parse(text, language: :en, model: :neural, neural_model: model) # Or use ensemble mode {:ok, ast} = Nasty.parse(text, language: :en, model: :neural_ensemble, neural_model: model) ``` 4. **Monitor performance**: - Track accuracy on representative sample - Monitor latency (should be <100ms per sentence on CPU) - Watch memory usage ## Next Steps - Read [NEURAL_MODELS.md](NEURAL_MODELS.md) for architecture details - See [PRETRAINED_MODELS.md](PRETRAINED_MODELS.md) for using Bumblebee transformers - Check [examples/](../examples/) for complete training scripts - Explore UD treebanks for more training data