# Statistical Models Guide This document provides a comprehensive guide to the advanced statistical models implemented in Nasty: **PCFG (Probabilistic Context-Free Grammar)** for parsing and **CRF (Conditional Random Fields)** for sequence labeling/NER. ## Overview Nasty implements two major classes of statistical models: 1. **PCFG Parser** - For probabilistic phrase structure parsing with ambiguity resolution 2. **CRF-based NER** - For context-aware named entity recognition using sequence labeling Both models follow the `Nasty.Statistics.Model` behaviour, providing consistent interfaces for training, prediction, and persistence. ## PCFG (Probabilistic Context-Free Grammar) ### What is PCFG? PCFG extends traditional context-free grammars with probabilities on production rules. This allows the parser to: - Resolve syntactic ambiguities probabilistically - Score different parse trees and select the most likely one - Handle rare constructions gracefully through smoothing ### Architecture **Core Modules:** - `Nasty.Statistics.Parsing.Grammar` - Rule representation and CNF conversion - `Nasty.Statistics.Parsing.CYKParser` - CYK parsing algorithm - `Nasty.Statistics.Parsing.PCFG` - Main model implementing `Model` behaviour **Data Flow:** ```mermaid flowchart TD A["Training Data (Treebank)"] B["Extract Grammar Rules + Probabilities"] C["CNF Conversion"] D["Trained PCFG Model"] E["CYK Parser (Viterbi)"] F["Parse Tree with Probability"] A --> B B --> C C --> D D --> E E --> F ``` ### Grammar Rules PCFG uses production rules with probabilities: ``` NP → Det Noun [0.35] NP → PropN [0.25] VP → Verb NP [0.45] ``` The sum of probabilities for all rules with the same left-hand side equals 1.0. ### CYK Algorithm The Cocke-Younger-Kasami algorithm: 1. Requires grammar in Chomsky Normal Form (CNF) 2. Uses dynamic programming (O(n³) complexity) 3. Fills a chart bottom-up 4. Extracts highest probability parse tree **Complexity:** - Time: O(n³ × |G|) where n = sentence length, |G| = grammar size - Space: O(n² × |G|) ### Training Train PCFG from Universal Dependencies treebanks or raw grammar rules: ```elixir # From raw rules training_data = [ {:np, [:det, :noun], 350}, # Count: 350 occurrences {:np, [:propn], 250}, {:vp, [:verb, :np], 450} ] model = PCFG.new() {:ok, trained} = PCFG.train(model, training_data, smoothing: 0.001) :ok = PCFG.save(trained, "priv/models/en/pcfg.model") ``` ### Prediction Parse sentences to get probabilistic parse trees: ```elixir {:ok, model} = PCFG.load("priv/models/en/pcfg.model") tokens = [%Token{text: "the", pos_tag: :det}, %Token{text: "cat", pos_tag: :noun}] {:ok, parse_tree} = PCFG.predict(model, tokens) # Parse tree contains: # - label: :np # - probability: 0.0245 # - children: [...] # - span: {0, 1} ``` ### N-Best Parsing Get multiple parse hypotheses: ```elixir {:ok, trees} = PCFG.predict(model, tokens, n_best: 5) # Returns top 5 parse trees sorted by probability ``` ### Evaluation Compute bracketing precision/recall/F1: ```elixir test_data = [{tokens, gold_tree}, ...] metrics = PCFG.evaluate(model, test_data) # %{precision: 0.87, recall: 0.85, f1: 0.86, exact_match: 0.42} ``` ### Mix Tasks ```bash # Train PCFG from UD treebank mix nasty.train.pcfg \ --corpus data/en_ewt-ud-train.conllu \ --output priv/models/en/pcfg.model \ --smoothing 0.001 \ --test data/en_ewt-ud-test.conllu # Evaluate PCFG mix nasty.eval \ --model priv/models/en/pcfg.model \ --test data/en_ewt-ud-test.conllu \ --type pcfg ``` ## CRF (Conditional Random Fields) ### What is CRF? CRFs are discriminative models for sequence labeling that consider: - Rich feature sets (lexical, orthographic, contextual) - Label dependencies (transition probabilities) - Global normalization (partition function) Unlike HMMs, CRFs can handle overlapping features and don't make independence assumptions. ### Architecture **Core Modules:** - `Nasty.Statistics.SequenceLabeling.Features` - Feature extraction - `Nasty.Statistics.SequenceLabeling.Viterbi` - Decoding algorithm - `Nasty.Statistics.SequenceLabeling.Optimizer` - Gradient descent training - `Nasty.Statistics.SequenceLabeling.CRF` - Main model implementing `Model` behaviour **Data Flow:** ```mermaid flowchart TD A["Tokens + Labels (Training)"] B["Feature Extraction"] C["Forward-Backward (Gradient Computation)"] D["Gradient Descent Optimization"] E["Trained CRF Model"] F["Viterbi Decoding"] G["Label Sequence"] A --> B B --> C C --> D D --> E E --> F F --> G ``` ### Feature Extraction CRFs use rich feature sets extracted from tokens: **Lexical Features:** - word, word_lower, lemma **Orthographic Features:** - capitalized, all_caps, word_shape (Xxxx, XXX, ddd) - has_digit, has_hyphen, has_punctuation **POS Features:** - pos_tag **Context Features:** - prev_word, next_word - prev_pos, next_pos - is_first, is_last **Affix Features:** - prefix-1, prefix-2, ..., prefix-4 - suffix-1, suffix-2, ..., suffix-4 **Gazetteer Features:** - in_gazetteer=person/place/org **Pattern Features:** - pattern=all_digits, pattern=year, pattern=acronym - short_word, long_word ### Model Linear-chain CRF: ``` P(y|x) = exp(score(x, y)) / Z(x) score(x, y) = Σ feature_weights + Σ transition_weights ``` Where: - `feature_weights`: Map of (feature, label) → weight - `transition_weights`: Map of (prev_label, curr_label) → weight - `Z(x)`: Partition function (normalization) ### Training Train CRF on BIO-tagged sequences: ```elixir # BIO tagging: B-PER, I-PER, B-GPE, I-GPE, B-ORG, I-ORG, O training_data = [ { [%Token{text: "John"}, %Token{text: "Smith"}], [:b_per, :i_per] }, ... ] model = CRF.new(labels: [:b_per, :i_per, :b_gpe, :i_gpe, :b_org, :i_org, :o]) {:ok, trained} = CRF.train(model, training_data, iterations: 100, learning_rate: 0.1, regularization: 1.0 ) :ok = CRF.save(trained, "priv/models/en/crf_ner.model") ``` **Training Options:** - `:iterations` - Maximum iterations (default: 100) - `:learning_rate` - Initial learning rate (default: 0.1) - `:regularization` - L2 regularization (default: 1.0) - `:method` - `:sgd`, `:momentum`, `:adagrad` (default: `:momentum`) - `:convergence_threshold` - Gradient norm threshold (default: 0.01) ### Prediction Label sequences using Viterbi decoding: ```elixir {:ok, model} = CRF.load("priv/models/en/crf_ner.model") tokens = [%Token{text: "John"}, %Token{text: "lives"}, %Token{text: "in"}, %Token{text: "NYC"}] {:ok, labels} = CRF.predict(model, tokens) # [:b_per, :o, :o, :b_gpe] ``` ### Viterbi Algorithm Find most likely label sequence: 1. Initialize scores for first position 2. For each subsequent position: - Compute emission score (from features) - Compute transition score (from previous label) - Track best previous label (backpointer) 3. Backtrack from best final label **Complexity:** - Time: O(n × L²) where n = sequence length, L = number of labels - Space: O(n × L) ### Forward-Backward Algorithm Used during training to compute gradients: - **Forward**: P(label at position t | observations up to t) - **Backward**: P(observations after t | label at t) - **Partition Function**: Z(x) = sum over all label sequences ### Optimization Gradient descent with momentum: ``` Gradient = Observed Features - Expected Features Weight Update: w := w - learning_rate * (gradient + regularization * w) Momentum: v := momentum * v + gradient ``` ### Mix Tasks ```bash # Train CRF NER mix nasty.train.crf \ --corpus data/ner_train.conllu \ --output priv/models/en/crf_ner.model \ --task ner \ --iterations 100 \ --learning-rate 0.1 \ --regularization 1.0 \ --test data/ner_test.conllu # Evaluate NER mix nasty.eval \ --model priv/models/en/crf_ner.model \ --test data/ner_test.conllu \ --type crf \ --task ner ``` ## Integration with English Pipeline Both models integrate seamlessly with the existing English module: ### PCFG Integration The PCFG parser is integrated into `Nasty.Language.English.SentenceParser`: ```elixir # Use PCFG parsing {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged, model: :pcfg) # Or directly with SentenceParser alias Nasty.Language.English.SentenceParser {:ok, sentences} = SentenceParser.parse_sentences(tokens, model: :pcfg) # With specific model (bypasses registry lookup) {:ok, pcfg_model} = PCFG.load("path/to/model.pcfg") {:ok, sentences} = SentenceParser.parse_sentences(tokens, model: :pcfg, pcfg_model: pcfg_model ) # Falls back to rule-based if :model option not specified or model unavailable {:ok, document} = English.parse(tagged) # Uses rule-based parsing ``` ### CRF Integration The CRF model is integrated into `Nasty.Language.English.EntityRecognizer`: ```elixir # Use CRF for NER alias Nasty.Language.English.EntityRecognizer {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) # CRF-based entity recognition entities = EntityRecognizer.recognize(tagged, model: :crf) # With specific model (bypasses registry lookup) {:ok, crf_model} = CRF.load("path/to/model.crf") entities = EntityRecognizer.recognize(tagged, model: :crf, crf_model: crf_model ) # Falls back to rule-based if :model option not specified or model unavailable entities = EntityRecognizer.recognize(tagged) # Uses rule-based NER ``` ### Model Registry Both integrations use `Nasty.Statistics.ModelLoader` to automatically load models: ```elixir # Models are loaded from the registry by task and language # PCFG: ModelLoader.load_latest(:en, :pcfg) # CRF: ModelLoader.load_latest(:en, :ner_crf) # Models should be saved with proper naming: # priv/models/en/pcfg_v1.model # priv/models/en/ner_crf_v1.model ``` ## Performance Expectations ### PCFG Parser **Accuracy:** - Bracketing F1: 85-90% on UD test sets - Higher than rule-based parsing for ambiguous structures **Speed:** - ~50-100ms per sentence (CPU) - Depends on sentence length and grammar size **Memory:** - ~50-100MB model file - O(n²) space during parsing ### CRF-based NER **Accuracy:** - Entity-level F1: 92-95% (vs 70-80% rule-based) - Proper boundary detection - Better handling of unseen entities **Speed:** - ~20-30ms per sentence (CPU) - Linear in sequence length **Memory:** - ~20-50MB model file (depends on feature set) - O(n) space during decoding ## Comparison with Other Approaches ### PCFG vs Rule-based Parsing | Aspect | PCFG | Rule-based | |--------|------|------------| | Ambiguity | Probabilistic resolution | Greedy heuristics | | Unknown structures | Graceful degradation | May fail | | Training | Requires treebank | None needed | | Speed | Slower (O(n³)) | Faster (O(n)) | | Accuracy | Higher on complex sentences | Good for simple sentences | ### CRF vs Rule-based NER | Aspect | CRF | Rule-based | |--------|-----|------------| | Context | Global sequence context | Local patterns | | Features | Rich feature sets | Limited to POS + patterns | | Boundaries | Learned from data | Heuristic rules | | Training | Requires annotated data | None needed | | Unseen entities | Better generalization | Pattern matching only | | Accuracy | 92-95% F1 | 70-80% F1 | ## Best Practices ### For PCFG 1. **Training Data**: Use high-quality treebanks (UD, Penn Treebank) 2. **Smoothing**: Use add-k smoothing (k=0.001) for unseen rules 3. **CNF Conversion**: Always convert to CNF before parsing 4. **Beam Search**: Use beam width 10-20 for efficiency 5. **Evaluation**: Report bracketing F1, not just accuracy ### For CRF 1. **Features**: Start with full feature set, prune if needed 2. **Regularization**: Use L2 (λ=1.0) to prevent overfitting 3. **Learning Rate**: Start with 0.1, decay if not converging 4. **BIO Tagging**: Always use BIO scheme for proper boundaries 5. **Gazetteers**: Include domain-specific entity lists 6. **Iterations**: 100-200 iterations usually sufficient ## Troubleshooting ### PCFG Issues **Problem**: Parse fails (returns :error) - **Solution**: Check if all words have lexical rules; add unknown word handling **Problem**: Low parsing F1 - **Solution**: Increase training data; adjust smoothing; check CNF conversion **Problem**: Slow parsing - **Solution**: Reduce beam width; prune low-probability rules ### CRF Issues **Problem**: Training doesn't converge - **Solution**: Reduce learning rate; increase regularization; check gradient computation **Problem**: Low NER F1 - **Solution**: Add more features; increase training data; check BIO tagging consistency **Problem**: Slow training - **Solution**: Reduce feature set; use AdaGrad; parallelize if possible ## Future Enhancements ### PCFG - Lexicalized PCFG (head-driven) - Latent variable PCFG - Neural PCFG with embeddings - Dependency conversion from CFG parse ### CRF - Higher-order CRF (beyond linear-chain) - Semi-Markov CRF for multi-token entities - Structured perceptron as alternative - Neural CRF with BiLSTM features ## References ### PCFG - Charniak, E. (1997). Statistical Parsing with a Context-Free Grammar and Word Statistics - Klein & Manning (2003). Accurate Unlexicalized Parsing - Petrov et al. (2006). Learning Accurate, Compact, and Interpretable Tree Annotation ### CRF - Lafferty et al. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data - Sutton & McCallum (2012). An Introduction to Conditional Random Fields - Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition ## Related Documentation - [ROADMAP.md](ROADMAP.md) - Development roadmap and priorities - [NEURAL_MODELS.md](NEURAL_MODELS.md) - Neural network architectures (BiLSTM-CRF) - [TRAINING_NEURAL.md](TRAINING_NEURAL.md) - Neural model training guide - [PARSING_GUIDE.md](PARSING_GUIDE.md) - Comprehensive parsing documentation - [WARP.md](../WARP.md) - Command reference for training and evaluation