# Parsing Guide This document provides a comprehensive technical guide to all parsing algorithms implemented in Nasty, including tokenization, POS tagging, morphological analysis, phrase parsing, sentence parsing, and dependency extraction. ## Table of Contents 1. [Pipeline Overview](#pipeline-overview) 2. [Tokenization](#tokenization) 3. [POS Tagging](#pos-tagging) 4. [Morphological Analysis](#morphological-analysis) 5. [Phrase Parsing](#phrase-parsing) 6. [Sentence Parsing](#sentence-parsing) 7. [Dependency Extraction](#dependency-extraction) 8. [Integration Example](#integration-example) ## Pipeline Overview The Nasty NLP pipeline processes text through the following stages: ```mermaid flowchart TD A[Input Text] B["[1] Tokenization (NimbleParsec)"] C["[2] POS Tagging (Rule-based / HMM / Neural)"] D["[3] Morphological Analysis (Lemmatization + Features)"] E["[4] Phrase Parsing (Bottom-up CFG)"] F["[5] Sentence Parsing (Clause Detection)"] G["[6] Dependency Extraction (UD Relations)"] H[Complete AST] A --> B B --> C C --> D D --> E E --> F F --> G G --> H ``` Each stage: - Takes structured input from the previous stage - Adds linguistic annotations - Preserves position tracking (span information) - Maintains language metadata ## Tokenization ### Algorithm: NimbleParsec Combinator Parsing **Module**: `Nasty.Language.English.Tokenizer` **Approach**: Bottom-up combinator-based parsing using NimbleParsec, processing text left-to-right with greedy longest-match. ### Token Types 1. **Hyphenated words**: `well-known`, `twenty-one` 2. **Contractions**: `don't`, `I'm`, `we've`, `it's` 3. **Numbers**: integers (`123`), decimals (`3.14`) 4. **Words**: alphabetic sequences 5. **Punctuation**: sentence-ending (`.`, `!`, `?`), commas, quotes, brackets, etc. ### Parser Combinators ```elixir # Order matters - more specific patterns first token = choice([ hyphenated, # "well-known" contraction, # "don't" number, # "123", "3.14" word, # "cat" punctuation # ".", ",", etc. ]) ``` ### Position Tracking Every token includes precise position information: ```elixir %Token{ text: "cat", span: %{ start_pos: {1, 5}, # {line, column} start_offset: 4, # byte offset end_pos: {1, 8}, end_offset: 7 } } ``` Position tracking handles: - Multi-line text with newline counting - Whitespace between tokens (ignored but tracked) - UTF-8 byte offsets vs. character positions ### Edge Cases - **Empty text**: Returns `{:ok, []}` - **Whitespace-only**: Returns `{:ok, []}` - **Unparseable text**: Returns `{:error, {:parse_incomplete, ...}}` - **Contractions**: Parsed as single tokens, not split ### Example ```elixir {:ok, tokens} = Tokenizer.tokenize("I don't know.") # => [ # %Token{text: "I", pos_tag: :x, span: ...}, # %Token{text: "don't", pos_tag: :x, span: ...}, # %Token{text: "know", pos_tag: :x, span: ...}, # %Token{text: ".", pos_tag: :punct, span: ...} # ] ``` ## POS Tagging ### Three Tagging Models **Module**: `Nasty.Language.English.POSTagger` Nasty supports three POS tagging approaches with different accuracy/speed tradeoffs: | Model | Accuracy | Speed | Method | |-------|----------|-------|--------| | Rule-based | ~85% | Very Fast | Lexical lookup + morphology + context | | HMM (Trigram) | ~95% | Fast | Viterbi decoding with add-k smoothing | | Neural (BiLSTM-CRF) | 97-98% | Moderate | Deep learning with contextual embeddings | ### 1. Rule-Based Tagging **Algorithm**: Sequential pattern matching with three-tier lookup #### Tagging Strategy 1. **Lexical Lookup**: Closed-class words (determiners, pronouns, prepositions, etc.) - 450+ words in lookup tables - Example: `"the"` → `:det`, `"in"` → `:adp`, `"and"` → `:cconj` 2. **Morphological Analysis**: Suffix-based tagging for open-class words ``` Nouns: -tion, -sion, -ment, -ness, -ity, -ism Verbs: -ing, -ed, -s/-es (3rd person singular) Adjectives: -ful, -less, -ous, -ive, -able, -ible Adverbs: -ly ``` 3. **Contextual Disambiguation**: Local context rules - Word after determiner → likely noun - Word after preposition → likely noun - Word before noun → likely adjective - Capitalized words → proper nouns #### Third-Person Singular Verb Detection Conservative approach to avoid mistagging plural nouns as verbs: ```elixir # "walks" → :verb (stem "walk" in common verb list) # "books" → :noun (not a verb stem) # "stations" → :noun (ends with -tions, noun suffix) ``` Checks: - Exclude capitalized words (proper nouns) - Exclude words with clear noun suffixes (-tions, -ments, etc.) - Verify stem is in common verb list (140+ verbs) ### 2. HMM-Based Tagging **Algorithm**: Viterbi decoding with trigram Hidden Markov Model #### Model Components 1. **Emission Probabilities**: P(word|tag) - Learned from tagged training data - Smoothing for unknown words: add-k smoothing (k=0.001) 2. **Transition Probabilities**: P(tag₃|tag₁, tag₂) - Trigram model for better context - Special START markers for sentence boundaries - Add-k smoothing for unseen trigrams 3. **Initial Probabilities**: P(tag) at sentence start - Distribution of first tags in training sentences #### Training Process ```elixir training_data = [ {["The", "cat", "sat"], [:det, :noun, :verb]}, ... ] model = HMMTagger.new() {:ok, trained} = HMMTagger.train(model, training_data, []) ``` Counts: - Emission counts: `{word, tag}` pairs - Transition counts: `{tag1, tag2} → tag3` trigrams - Initial counts: first tag in each sequence Normalization: ``` P(word|tag) = (count(word, tag) + k) / (sum(word, tag) + k * vocab_size) P(tag3|tag1,tag2) = (count(tag1,tag2,tag3) + k) / (sum(tag1,tag2,*) + k * num_tags) ``` #### Viterbi Decoding Dynamic programming algorithm to find most likely tag sequence: ``` score[t][tag] = max over prev_tags of: score[t-1][prev_tag] + log P(tag|prev_prev_tag, prev_tag) + log P(word_t|tag) ``` Steps: 1. **Initialization**: Score each tag for first word 2. **Forward Pass**: Compute best score for each (position, tag) pair 3. **Backpointers**: Track best previous tag for reconstruction 4. **Backtracking**: Reconstruct best path from end to start ### 3. Neural Tagging (BiLSTM-CRF) **Algorithm**: Bidirectional LSTM with Conditional Random Field layer **Module**: `Nasty.Statistics.POSTagging.NeuralTagger` #### Architecture ```mermaid flowchart TD A["Input: Word IDs [batch_size, seq_len]"] B["Word Embeddings [batch_size, seq_len, embedding_dim]"] C["BiLSTM Layers (×2) [batch_size, seq_len, hidden_size * 2]"] D["Linear Projection [batch_size, seq_len, num_tags]"] E["CRF Layer (optional) [batch_size, seq_len, num_tags]"] F["Output: Tag IDs [batch_size, seq_len]"] A --> B B --> C C --> D D --> E E --> F ``` #### Key Components 1. **Word Embeddings**: 300-dimensional learned representations - Vocabulary built from training data (min frequency = 2) - Unknown words mapped to special UNK token 2. **Bidirectional LSTM**: 2 layers, 256 hidden units each - Forward LSTM: left-to-right context - Backward LSTM: right-to-left context - Concatenated outputs: 512 dimensions 3. **CRF Layer**: Learns tag transition constraints - Enforces valid tag sequences (e.g., DET → NOUN more likely than DET → VERB) - Joint decoding over entire sequence 4. **Dropout**: 0.3 rate for regularization #### Training ```elixir tagger = NeuralTagger.new(vocab_size: 10000, num_tags: 17) training_data = [{["The", "cat"], [:det, :noun]}, ...] {:ok, trained} = NeuralTagger.train(tagger, training_data, epochs: 10, batch_size: 32, learning_rate: 0.001, validation_split: 0.1 ) ``` Training features: - Adam optimizer (adaptive learning rate) - Cross-entropy loss (or CRF loss if using CRF layer) - Early stopping with patience=3 - Validation set monitoring (10% split) #### Inference ```elixir {:ok, tags} = NeuralTagger.predict(trained, ["The", "cat", "sat"], []) # => {:ok, [:det, :noun, :verb]} ``` Steps: 1. Convert words to IDs using vocabulary 2. Pad sequences to batch size 3. Run through BiLSTM-CRF model 4. Argmax over tag dimension (or Viterbi if using CRF) 5. Convert tag IDs back to atoms ### Model Selection Use `:model` option in `POSTagger.tag_pos/2`: ```elixir # Rule-based (fast, ~85% accuracy) {:ok, tokens} = POSTagger.tag_pos(tokens, model: :rule_based) # HMM (fast, ~95% accuracy) {:ok, tokens} = POSTagger.tag_pos(tokens, model: :hmm) # Neural (moderate, 97-98% accuracy) {:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural) # Ensemble: HMM + rule-based fallback for punctuation/numbers {:ok, tokens} = POSTagger.tag_pos(tokens, model: :ensemble) # Neural ensemble: Neural + rule-based fallback {:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural_ensemble) ``` ## Morphological Analysis ### Algorithm: Dictionary + Rule-Based Lemmatization **Module**: `Nasty.Language.English.Morphology` **Approach**: Two-tier lemmatization with irregular form lookup followed by rule-based suffix removal. ### Lemmatization Process #### 1. Irregular Form Lookup Check dictionaries for common irregular forms: **Verbs** (80+ irregular verbs): ``` "went" → "go", "was" → "be", "ate" → "eat", "ran" → "run" ``` **Nouns** (12 irregular nouns): ``` "children" → "child", "men" → "man", "mice" → "mouse" ``` **Adjectives** (12 irregular comparatives/superlatives): ``` "better" → "good", "best" → "good", "worse" → "bad" ``` #### 2. Rule-Based Suffix Removal If no irregular form found, apply POS-specific rules: **Verbs**: ``` -ing → stem (handling doubled consonants) "running" → "run" (remove doubled 'n') "making" → "make" -ed → stem (handling doubled consonants, silent e) "stopped" → "stop" (remove doubled 'p') "liked" → "like" (restore silent 'e') -s → base form (3rd person singular) "walks" → "walk" ``` **Nouns**: ``` -ies → -y (flies → fly) -es → base (if stem ends in s/x/z/ch/sh) "boxes" → "box", "dishes" → "dish" -s → base (cats → cat) ``` **Adjectives**: ``` -est → base (superlative) "fastest" → "fast" (handle doubled consonants) -er → base (comparative) "faster" → "fast" ``` ### Morphological Feature Extraction #### Verb Features ```elixir %{ tense: :present | :past, aspect: :progressive, # for -ing forms person: 3, # for 3rd person singular number: :singular } ``` Examples: - `"running"` → `%{tense: :present, aspect: :progressive}` - `"walked"` → `%{tense: :past}` - `"walks"` → `%{tense: :present, person: 3, number: :singular}` #### Noun Features ```elixir %{number: :singular | :plural} ``` Examples: - `"cat"` → `%{number: :singular}` - `"cats"` → `%{number: :plural}` #### Adjective Features ```elixir %{degree: :positive | :comparative | :superlative} ``` Examples: - `"fast"` → `%{degree: :positive}` - `"faster"` → `%{degree: :comparative}` - `"fastest"` → `%{degree: :superlative}` ### Example ```elixir {:ok, tokens} = Tokenizer.tokenize("running cats") {:ok, tagged} = POSTagger.tag_pos(tokens) {:ok, analyzed} = Morphology.analyze(tagged) # => [ # %Token{text: "running", pos_tag: :verb, lemma: "run", # morphology: %{tense: :present, aspect: :progressive}}, # %Token{text: "cats", pos_tag: :noun, lemma: "cat", # morphology: %{number: :plural}} # ] ``` ## Phrase Parsing ### Algorithm: Bottom-Up Pattern Matching with Context-Free Grammar **Module**: `Nasty.Language.English.PhraseParser` **Approach**: Greedy longest-match, left-to-right phrase construction using simplified CFG rules. ### Grammar Rules ``` NP → Det? Adj* (Noun | PropN | Pron) (PP | RelClause)* VP → Aux* Verb (NP)? (PP | AdvP)* PP → Prep NP AdjP → Adv? Adj AdvP → Adv RC → RelPron/RelAdv Clause ``` ### Phrase Types #### 1. Noun Phrase (NP) **Components**: - **Determiner** (optional): `the`, `a`, `my`, `some` - **Modifiers** (0+): adjectives, adjectival phrases - **Head** (required): noun, proper noun, or pronoun - **Post-modifiers** (0+): prepositional phrases, relative clauses **Examples**: ``` "the cat" → [det: "the", head: "cat"] "the big cat" → [det: "the", modifiers: ["big"], head: "cat"] "the cat on the mat" → [det: "the", head: "cat", post_modifiers: [PP("on", NP("the mat"))]] ``` **Special Cases**: - **Pronouns as NPs**: `"I"`, `"he"`, `"they"` can stand alone - **Multi-word proper nouns**: `"New York"` → consecutive PROPNs merged as modifiers #### 2. Verb Phrase (VP) **Components**: - **Auxiliaries** (0+): `is`, `have`, `will`, `can` - **Head** (required): main verb - **Complements** (0+): object NP, PPs, adverbs **Examples**: ``` "sat" → [head: "sat"] "is running" → [auxiliaries: ["is"], head: "running"] "saw the cat" → [head: "saw", complements: [NP("the cat")]] "sat on the mat" → [head: "sat", complements: [PP("on", NP("the mat"))]] ``` **Special Case - Copula Construction**: If only auxiliaries found (no main verb), treat last auxiliary as main verb: ``` "is happy" → [head: "is", complements: [AdjP("happy")]] "are engineers" → [head: "are", complements: [NP("engineers")]] ``` #### 3. Prepositional Phrase (PP) **Structure**: `Prep + NP` **Examples**: ``` "on the mat" → [head: "on", object: NP("the mat")] "in the house" → [head: "in", object: NP("the house")] ``` #### 4. Adjectival Phrase (AdjP) **Structure**: `Adv? + Adj` **Examples**: ``` "very big" → [intensifier: "very", head: "big"] "quite small" → [intensifier: "quite", head: "small"] ``` #### 5. Adverbial Phrase (AdvP) **Structure**: `Adv` (currently simple single-word adverbs) **Examples**: ``` "quickly" → [head: "quickly"] "often" → [head: "often"] ``` #### 6. Relative Clause (RC) **Structure**: `RelPron/RelAdv + Clause` **Relativizers**: - Pronouns: `who`, `whom`, `whose`, `which`, `that` - Adverbs: `where`, `when`, `why` **Examples**: ``` "that sits" → [relativizer: "that", clause: VP("sits")] "who I know" → [relativizer: "who", clause: [subject: NP("I"), predicate: VP("know")]] ``` **Two Patterns**: 1. **Relativizer as subject**: `"that sits"` → clause has only VP 2. **Relativizer as object**: `"that I see"` → clause has NP subject + VP ### Parsing Process Each `parse_*_phrase` function: 1. Checks current position in token list 2. Attempts to consume tokens matching the pattern 3. Recursively parses sub-phrases (e.g., NP within PP) 4. Calculates span from first to last consumed token 5. Returns `{:ok, phrase, next_position}` or `:error` **Greedy Matching**: Consumes as many tokens as possible for each phrase (e.g., all consecutive adjectives as modifiers). **Position Tracking**: Every phrase includes span covering all constituent tokens. ### Example ```elixir tokens = [ %Token{text: "the", pos_tag: :det}, %Token{text: "big", pos_tag: :adj}, %Token{text: "cat", pos_tag: :noun}, %Token{text: "on", pos_tag: :adp}, %Token{text: "the", pos_tag: :det}, %Token{text: "mat", pos_tag: :noun} ] {:ok, np, _pos} = PhraseParser.parse_noun_phrase(tokens, 0) # => %NounPhrase{ # determiner: "the", # modifiers: ["big"], # head: "cat", # post_modifiers: [ # %PrepositionalPhrase{ # head: "on", # object: %NounPhrase{determiner: "the", head: "mat"} # } # ] # } ``` ## Sentence Parsing ### Algorithm: Clause Detection with Coordination and Subordination **Module**: `Nasty.Language.English.SentenceParser` **Approach**: Split on sentence boundaries, then parse each sentence into clauses with support for simple, compound, and complex structures. ### Sentence Structures 1. **Simple**: Single independent clause - `"The cat sat."` 2. **Compound**: Multiple coordinated independent clauses - `"The cat sat and the dog ran."` 3. **Complex**: Independent clause with subordinate clause(s) - `"The cat sat because it was tired."` 4. **Fragment**: Incomplete sentence (e.g., subordinate clause alone) ### Sentence Functions Inferred from punctuation: - `.` → `:declarative` (statement) - `?` → `:interrogative` (question) - `!` → `:exclamative` (exclamation) ### Parsing Process #### 1. Sentence Boundary Detection Split on sentence-ending punctuation (`.`, `!`, `?`): ```elixir split_sentences(tokens) # Groups tokens into sentence units ``` #### 2. Clause Parsing For each sentence group, parse into clause structure: **Grammar**: ``` Sentence → Clause+ Clause → SubordConj? NP? VP ``` **Three Clause Types**: - **Independent**: Can stand alone as complete sentence - **Subordinate**: Begins with subordinating conjunction (`because`, `if`, `when`, etc.) - **Relative**: Part of relative clause structure (handled in phrase parsing) #### 3. Coordination Detection Look for coordinating conjunctions (`:cconj`): - `and`, `or`, `but`, `nor`, `yet`, `so`, `for` If found, split and parse both sides: ```elixir "The cat sat and the dog ran" # Split at "and" # Parse: Clause1 ("The cat sat") + Clause2 ("the dog ran") # Result: [Clause1, Clause2] ``` #### 4. Subordination Detection Check for subordinating conjunction (`:sconj`) at start: - `after`, `although`, `because`, `before`, `if`, `since`, `when`, `while`, etc. If found, mark clause as subordinate: ```elixir "because it was tired" # Parse: Clause with subordinator: "because" # Type: :subordinate ``` ### Simple Clause Parsing **Algorithm**: Find verb, split at verb to identify subject and predicate. **Steps**: 1. Find first verb/auxiliary in token sequence 2. **If verb at position 0**: Imperative sentence (no subject) - Parse VP starting at position 0 - Subject = nil 3. **If verb at position > 0**: Declarative sentence - Try to parse NP before verb (subject) - Parse VP starting at end of subject (predicate) 4. **If no subject found**: Try VP alone (imperative or fragment) **Fallback**: If parsing fails, create minimal clause with first verb found. ### Clause Structure ```elixir %Clause{ type: :independent | :subordinate | :relative, subordinator: Token.t() | nil, # "because", "if", etc. subject: NounPhrase.t() | nil, predicate: VerbPhrase.t(), language: :en, span: span } ``` ### Sentence Structure ```elixir %Sentence{ function: :declarative | :interrogative | :exclamative, structure: :simple | :compound | :complex | :fragment, main_clause: Clause.t(), additional_clauses: [Clause.t()], # for compound sentences language: :en, span: span } ``` ### Example ```elixir tokens = tokenize_and_tag("The cat sat and the dog ran.") {:ok, [sentence]} = SentenceParser.parse_sentences(tokens) # => %Sentence{ # function: :declarative, # structure: :compound, # main_clause: %Clause{ # type: :independent, # subject: NP("The cat"), # predicate: VP("sat") # }, # additional_clauses: [ # %Clause{ # type: :independent, # subject: NP("the dog"), # predicate: VP("ran") # } # ] # } ``` ## Dependency Extraction ### Algorithm: Phrase Structure to Universal Dependencies Conversion **Module**: `Nasty.Language.English.DependencyExtractor` **Approach**: Traverse phrase structure AST and extract grammatical relations as Universal Dependencies (UD) relations. ### Universal Dependencies Relations Nasty uses the UD relation taxonomy: **Core Arguments**: - `nsubj` - nominal subject - `obj` - direct object - `iobj` - indirect object **Non-Core Dependents**: - `obl` - oblique nominal (prepositional complement to verb) - `advmod` - adverbial modifier - `aux` - auxiliary verb **Nominal Dependents**: - `det` - determiner - `amod` - adjectival modifier - `nmod` - nominal modifier (prepositional complement to noun) - `case` - case marking (preposition) **Clausal Dependents**: - `acl` - adnominal clause (relative clause) - `mark` - subordinating marker **Coordination**: - `conj` - conjunct - `cc` - coordinating conjunction ### Extraction Process #### 1. Sentence-Level Extraction ```elixir extract(sentence) # Extracts from main_clause + additional_clauses ``` #### 2. Clause-Level Extraction For each clause: 1. **Subject Dependency**: `nsubj(predicate_head, subject_head)` - Extract head token from subject NP - Extract head token from predicate VP - Create dependency relation 2. **Predicate Dependencies**: Extract from VP (see below) 3. **Subordinator Dependency** (if present): `mark(predicate_head, subordinator)` #### 3. Noun Phrase Dependencies From NP structure: 1. **Determiner**: `det(head, determiner)` - `"the cat"` → `det(cat, the)` 2. **Adjectival Modifiers**: `amod(head, modifier)` - `"big cat"` → `amod(cat, big)` 3. **Post-modifiers**: - **PP**: `case(pp_object_head, preposition)` + `nmod(np_head, pp_object_head)` - `"cat on mat"` → `case(mat, on)` + `nmod(cat, mat)` - **Relative Clause**: `mark(clause_head, relativizer)` + `acl(np_head, clause_head)` - `"cat that sits"` → `mark(sits, that)` + `acl(cat, sits)` #### 4. Verb Phrase Dependencies From VP structure: 1. **Auxiliaries**: `aux(main_verb, auxiliary)` - `"is running"` → `aux(running, is)` 2. **Complements**: - **Direct Object NP**: `obj(verb, np_head)` - `"saw cat"` → `obj(saw, cat)` - **PP Complement**: `case(pp_object, preposition)` + `obl(verb, pp_object)` - `"sat on mat"` → `case(mat, on)` + `obl(sat, mat)` - **Adverb**: `advmod(verb, adverb)` - `"ran quickly"` → `advmod(ran, quickly)` #### 5. Prepositional Phrase Dependencies From PP structure: 1. **Case Marking**: `case(pp_object_head, preposition)` 2. **Oblique/Nominal Modifier**: - If governor is verb: `obl(governor, pp_object_head)` - If governor is noun: `nmod(governor, pp_object_head)` ### Dependency Structure ```elixir %Dependency{ relation: :nsubj | :obj | :det | ..., head: Token.t(), # Governor token dependent: Token.t(), # Dependent token span: span } ``` ### Example ```elixir # Input: "The cat sat on the mat." sentence = parse("The cat sat on the mat.") dependencies = DependencyExtractor.extract(sentence) # => [ # %Dependency{relation: :det, head: "cat", dependent: "the"}, # %Dependency{relation: :nsubj, head: "sat", dependent: "cat"}, # %Dependency{relation: :case, head: "mat", dependent: "on"}, # %Dependency{relation: :det, head: "mat", dependent: "the"}, # %Dependency{relation: :obl, head: "sat", dependent: "mat"} # ] ``` ### Visualization Dependencies can be visualized as a directed graph: ```mermaid graph TD Root["sat (ROOT)"] Cat[cat] Mat[mat] The1[the] On[on] The2[the] Root -->|nsubj| Cat Root -->|obl| Mat Cat -->|det| The1 Mat -->|case| On Mat -->|det| The2 ``` ## Integration Example Complete pipeline from text to dependencies: ```elixir alias Nasty.Language.English.{ Tokenizer, POSTagger, Morphology, PhraseParser, SentenceParser, DependencyExtractor } # Input text text = "The big cat sat on the mat." # Step 1: Tokenization {:ok, tokens} = Tokenizer.tokenize(text) # => [Token("The"), Token("big"), Token("cat"), ...] # Step 2: POS Tagging (choose model) {:ok, tagged} = POSTagger.tag_pos(tokens, model: :neural) # => [Token("The", :det), Token("big", :adj), Token("cat", :noun), ...] # Step 3: Morphological Analysis {:ok, analyzed} = Morphology.analyze(tagged) # => [Token("The", :det, lemma: "the"), ...] # Step 4: Sentence Parsing (includes phrase parsing internally) {:ok, sentences} = SentenceParser.parse_sentences(analyzed) # => [Sentence(...)] # Step 5: Dependency Extraction sentence = hd(sentences) dependencies = DependencyExtractor.extract(sentence) # => [Dependency(:det, "cat", "The"), ...] # Result: Complete AST with dependencies sentence # => %Sentence{ # main_clause: %Clause{ # subject: %NounPhrase{ # determiner: Token("The"), # modifiers: [Token("big")], # head: Token("cat") # }, # predicate: %VerbPhrase{ # head: Token("sat"), # complements: [ # %PrepositionalPhrase{ # head: Token("on"), # object: %NounPhrase{...} # } # ] # } # } # } ``` ## Performance Considerations ### Model Selection **For Production**: - Use neural models for highest accuracy - Cache loaded models in memory - Batch sentences for GPU acceleration (if available) **For Development/Testing**: - Use rule-based for fastest iteration - HMM for good balance of speed and accuracy ### Optimization Tips 1. **Batch Processing**: Process multiple sentences together 2. **Model Caching**: Load models once, reuse across requests 3. **Lazy Loading**: Only load neural models when needed 4. **Parallel Processing**: Use `Task.async_stream` for multiple sentences ### Accuracy Benchmarks Tested on Universal Dependencies English-EWT test set: | Component | Accuracy | |-----------|----------| | Tokenization | 99.9% | | Rule-based POS | 85% | | HMM POS | 95% | | Neural POS | 97-98% | | Phrase Parsing | 87% (F1) | | Dependency Extraction | 82% (UAS) | ## Further Reading - [Universal Dependencies](https://universaldependencies.org/) - UD relations and guidelines - [Penn Treebank POS Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) - [NimbleParsec Documentation](https://hexdocs.pm/nimble_parsec/) - [Axon Neural Networks](https://hexdocs.pm/axon/) - See `docs/ARCHITECTURE.md` for overall system design - See `docs/NEURAL_MODELS.md` for neural network details