# Nasty User Guide A comprehensive guide to using the Nasty NLP library for natural language processing in Elixir. ## Table of Contents 1. [Introduction](#introduction) 2. [Installation](#installation) 3. [Quick Start](#quick-start) 4. [Core Concepts](#core-concepts) 5. [Basic Text Processing](#basic-text-processing) 6. [Phrase and Sentence Parsing](#phrase-and-sentence-parsing) 7. [Semantic Analysis](#semantic-analysis) 8. [Advanced NLP Operations](#advanced-nlp-operations) 9. [Code Interoperability](#code-interoperability) 10. [AST Manipulation](#ast-manipulation) 11. [Visualization and Debugging](#visualization-and-debugging) 12. [Statistical Models](#statistical-models) 13. [Performance Tips](#performance-tips) 14. [Troubleshooting](#troubleshooting) ## Introduction Nasty (Natural Abstract Syntax Treey) is a comprehensive NLP library that treats natural language with the same rigor as programming languages. It provides a complete grammatical Abstract Syntax Tree (AST) for English, enabling sophisticated text analysis and manipulation. ### Key Features - **Complete NLP Pipeline**: From tokenization to summarization - **Grammar-First Design**: Linguistically rigorous AST structure - **Statistical Models**: HMM POS tagger with 95% accuracy - **Bidirectional Code Conversion**: Natural language ↔ Elixir code - **AST Utilities**: Traversal, querying, validation, and transformation - **Visualization**: Export to DOT/Graphviz and JSON formats ## Installation Add `nasty` to your dependencies in `mix.exs`: ```elixir def deps do [ {:nasty, "~> 0.1.0"} ] end ``` Then run: ```bash mix deps.get ``` ## Quick Start Here's a simple example to get started: ```elixir alias Nasty.Language.English # Parse a sentence text = "The quick brown fox jumps over the lazy dog." {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged) # Extract information alias Nasty.Utils.Query # Count tokens token_count = Query.count(document, :token) # => 9 # Find all nouns nouns = Query.find_by_pos(document, :noun) # => [%Token{text: "fox", ...}, %Token{text: "dog", ...}] # Render back to text alias Nasty.Rendering.Text {:ok, text} = Text.render(document) # => "The quick brown fox jumps over the lazy dog." ``` ## Core Concepts ### AST Structure Nasty represents text as a hierarchical tree structure: ``` Document └── Paragraph └── Sentence └── Clause ├── Subject (NounPhrase) │ ├── Determiner (Token) │ ├── Modifiers (Tokens) │ └── Head (Token) └── Predicate (VerbPhrase) ├── Auxiliaries (Tokens) ├── Head (Token) └── Complements (NounPhrases, etc.) ``` ### Universal Dependencies All POS tags and dependency relations follow the Universal Dependencies standard: **POS Tags**: `noun`, `verb`, `adj`, `adv`, `det`, `adp`, `aux`, `cconj`, `sconj`, `pron`, `propn`, `num`, `punct` **Dependencies**: `nsubj`, `obj`, `iobj`, `amod`, `advmod`, `det`, `case`, `acl`, `advcl`, `conj`, `cc` ### Language Markers Every AST node carries a language identifier (`:en` for English), enabling future multilingual support. ## Basic Text Processing ### Tokenization Split text into tokens (words and punctuation): ```elixir alias Nasty.Language.English text = "Hello, world! How are you?" {:ok, tokens} = English.tokenize(text) # Tokens include position information Enum.each(tokens, fn token -> IO.puts("#{token.text} at #{inspect(token.span)}") end) ``` ### POS Tagging Assign grammatical categories to tokens: ```elixir # Rule-based tagging (fast, ~85% accuracy) {:ok, tagged} = English.tag_pos(tokens) # Statistical tagging (higher accuracy, ~95%) {:ok, tagged} = English.tag_pos(tokens, model: :hmm) # Neural tagging (best accuracy, 97-98%) {:ok, tagged} = English.tag_pos(tokens, model: :neural) # Ensemble (combines all models) {:ok, tagged} = English.tag_pos(tokens, model: :ensemble) # Inspect tags Enum.each(tagged, fn token -> IO.puts("#{token.text}: #{token.pos_tag}") end) ``` ### Morphological Analysis Extract lemmas and morphological features: ```elixir alias Nasty.Language.English.Morphology tagged |> Enum.map(fn token -> lemma = Morphology.lemmatize(token.text, token.pos_tag) features = Morphology.extract_features(token.text, token.pos_tag) {token.text, lemma, features} end) |> Enum.each(fn {text, lemma, features} -> IO.puts("#{text} -> #{lemma} (#{inspect(features)})") end) ``` ## Phrase and Sentence Parsing ### Building the AST Parse tokens into a complete AST: ```elixir text = "The cat sat on the mat." {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged) # Access structure paragraph = List.first(document.paragraphs) sentence = List.first(paragraph.sentences) IO.puts("Sentence type: #{sentence.function}, #{sentence.structure}") ``` ### Phrase Structure Extract and analyze phrases: ```elixir alias Nasty.Utils.Query # Find all noun phrases noun_phrases = Query.find_all(document, :noun_phrase) Enum.each(noun_phrases, fn np -> det = if np.determiner, do: np.determiner.text, else: "" mods = Enum.map(np.modifiers, & &1.text) |> Enum.join(" ") head = np.head.text IO.puts("NP: #{det} #{mods} #{head}") end) # Find verb phrases verb_phrases = Query.find_all(document, :verb_phrase) Enum.each(verb_phrases, fn vp -> aux = Enum.map(vp.auxiliaries, & &1.text) |> Enum.join(" ") verb = vp.head.text IO.puts("VP: #{aux} #{verb}") end) ``` ### Sentence Structure Analysis Analyze sentence complexity: ```elixir document.paragraphs |> Enum.flat_map(& &1.sentences) |> Enum.each(fn sentence -> IO.puts("Function: #{sentence.function}") IO.puts("Structure: #{sentence.structure}") IO.puts("Clauses: #{1 + length(sentence.additional_clauses)}") IO.puts("") end) ``` ### Dependency Relations Extract grammatical dependencies: ```elixir alias Nasty.Language.English.DependencyExtractor sentences = document.paragraphs |> Enum.flat_map(& &1.sentences) Enum.each(sentences, fn sentence -> deps = DependencyExtractor.extract(sentence) Enum.each(deps, fn dep -> IO.puts("#{dep.head.text} --#{dep.relation}--> #{dep.dependent.text}") end) end) ``` ## Semantic Analysis ### Named Entity Recognition Extract and classify named entities: ```elixir alias Nasty.Language.English.EntityRecognizer text = "John Smith works at Google in New York." {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) entities = EntityRecognizer.recognize(tagged) Enum.each(entities, fn entity -> IO.puts("#{entity.text}: #{entity.type} (confidence: #{entity.confidence})") end) # => John Smith: PERSON (confidence: 0.8) # Google: ORG (confidence: 0.8) # New York: GPE (confidence: 0.7) ``` ### Semantic Role Labeling Identify who did what to whom: ```elixir {:ok, document} = English.parse(tagged, semantic_roles: true) document.semantic_frames |> Enum.each(fn frame -> IO.puts("Predicate: #{frame.predicate}") Enum.each(frame.roles, fn role -> IO.puts(" #{role.type}: #{role.text}") end) end) # => Predicate: works # agent: John Smith # location: at Google ``` ### Coreference Resolution Link mentions across sentences: ```elixir text = """ John Smith is a software engineer. He works at Google. The company is based in Mountain View. """ {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged, coreference: true) document.coref_chains |> Enum.each(fn chain -> IO.puts("Representative: #{chain.representative.text}") IO.puts("Mentions: #{Enum.map(chain.mentions, & &1.text) |> Enum.join(", ")}") end) # => Representative: John Smith # Mentions: John Smith, He ``` ## Advanced NLP Operations ### Text Summarization Extract key sentences from documents: ```elixir alias Nasty.Language.English alias Nasty.Rendering.Text long_text = """ [Your long document here...] """ {:ok, tokens} = English.tokenize(long_text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged) # Extractive summarization - returns list of Sentence structs summary_sentences = English.summarize(document, ratio: 0.3) IO.puts("30% summary (#{length(summary_sentences)} sentences):") # Render summary sentences to text Enum.each(summary_sentences, fn sentence -> {:ok, text} = Text.render(sentence) IO.puts(text) end) # Fixed sentence count summary_sentences = English.summarize(document, max_sentences: 3) # MMR for reduced redundancy summary_sentences = English.summarize(document, max_sentences: 3, method: :mmr, mmr_lambda: 0.5 ) ``` ### Question Answering Answer questions from documents: ```elixir text = """ John Smith is a software engineer at Google. He graduated from Stanford University in 2010. Google is headquartered in Mountain View, California. """ {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged) # Ask questions questions = [ "Who works at Google?", "Where is Google located?", "When did John Smith graduate?", "What is John Smith's profession?" ] Enum.each(questions, fn question -> {:ok, answers} = English.answer_question(document, question) IO.puts("Q: #{question}") Enum.each(answers, fn answer -> IO.puts("A: #{answer.text} (confidence: #{answer.confidence})") end) IO.puts("") end) ``` ### Text Classification Train and apply classifiers: ```elixir alias Nasty.Language.English # Prepare training data positive_reviews = [ "This product is amazing! Highly recommended.", "Excellent quality and fast shipping.", "Love it! Best purchase ever." ] negative_reviews = [ "Terrible product. Waste of money.", "Poor quality and slow delivery.", "Very disappointed with this purchase." ] # Parse documents training_data = Enum.map(positive_reviews, fn text -> {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, doc} = English.parse(tagged) {doc, :positive} end) ++ Enum.map(negative_reviews, fn text -> {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, doc} = English.parse(tagged) {doc, :negative} end) # Train classifier model = English.train_classifier(training_data, features: [:bow, :lexical] ) # Classify new text test_text = "Great product, very satisfied!" {:ok, tokens} = English.tokenize(test_text) {:ok, tagged} = English.tag_pos(tokens) {:ok, doc} = English.parse(tagged) {:ok, predictions} = English.classify(doc, model) IO.inspect(predictions) ``` ### Information Extraction Extract structured information: ```elixir text = """ Apple Inc. acquired Beats Electronics for $3 billion in 2014. The company is headquartered in Cupertino, California. Tim Cook serves as CEO of Apple. """ {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, document} = English.parse(tagged) # Extract relations {:ok, relations} = English.extract_relations(document) Enum.each(relations, fn rel -> IO.puts("#{rel.subject.text} --#{rel.type}--> #{rel.object.text}") end) # Extract events {:ok, events} = English.extract_events(document) Enum.each(events, fn event -> IO.puts("Event: #{event.type}") IO.puts("Trigger: #{event.trigger}") IO.puts("Participants: #{inspect(event.participants)}") end) # Template-based extraction alias Nasty.Language.English.TemplateExtractor templates = [ TemplateExtractor.employment_template(), TemplateExtractor.acquisition_template() ] {:ok, results} = English.extract_templates(document, templates) Enum.each(results, fn result -> IO.puts("Template: #{result.template}") IO.puts("Slots: #{inspect(result.slots)}") end) ``` ## Code Interoperability ### Natural Language to Code Convert natural language commands to Elixir code: ```elixir alias Nasty.Language.English # Simple operations {:ok, code} = English.to_code("Sort the list") IO.puts(code) # => "Enum.sort(list)" {:ok, code} = English.to_code("Filter users where age is greater than 18") IO.puts(code) # => "Enum.filter(users, fn item -> item > 18 end)" {:ok, code} = English.to_code("Map the numbers to double each one") IO.puts(code) # => "Enum.map(numbers, fn item -> item * 2 end)" # Get the AST {:ok, ast} = English.to_code_ast("Sort the numbers") IO.inspect(ast) # Recognize intent without generating code {:ok, intent} = English.recognize_intent("Filter the list") IO.inspect(intent) ``` ### Code to Natural Language Explain code in natural language: ```elixir alias Nasty.Language.English # Explain code strings {:ok, explanation} = English.explain_code("Enum.sort(numbers)") IO.puts(explanation) # => "sort numbers" {:ok, explanation} = English.explain_code(""" list |> Enum.map(&(&1 * 2)) |> Enum.filter(&(&1 > 10)) |> Enum.sum() """) IO.puts(explanation) # => "map list to each element times 2, then filter list where item is greater than 10, then sum list" # Explain from AST code_ast = quote do: x = a + b {:ok, doc} = English.explain_code_to_document(code_ast) {:ok, text} = Nasty.Rendering.Text.render(doc) IO.puts(text) ``` ## Translation ### AST-Based Translation Translate documents between languages while preserving grammatical structure: ```elixir alias Nasty.Language.{English, Spanish} alias Nasty.Translation.Translator # English to Spanish text_en = "The quick cat runs in the garden." {:ok, tokens_en} = English.tokenize(text_en) {:ok, tagged_en} = English.tag_pos(tokens_en) {:ok, doc_en} = English.parse(tagged_en) # Translate document {:ok, doc_es} = Translator.translate_document(doc_en, :es) # Render Spanish text alias Nasty.Rendering.Text {:ok, text_es} = Text.render(doc_es) IO.puts(text_es) # => "El gato rápido corre en el jardín." # Or translate text directly {:ok, text_es} = Translator.translate("The quick cat runs.", :en, :es) IO.puts(text_es) # => "El gato rápido corre." # Spanish to English text_es = "La casa grande está en la ciudad." {:ok, tokens_es} = Spanish.tokenize(text_es) {:ok, tagged_es} = Spanish.tag_pos(tokens_es) {:ok, doc_es} = Spanish.parse(tagged_es) {:ok, doc_en} = Translator.translate_document(doc_es, :en) {:ok, text_en} = Text.render(doc_en) IO.puts(text_en) # => "The big house is in the city." ``` ### How Translation Works The translation system operates on AST structures, not raw text: 1. **Parse source text** to AST 2. **Transform AST nodes** to target language structure 3. **Translate tokens** using lemma-to-lemma mapping with POS tags 4. **Apply morphological agreement** (gender, number, person) 5. **Apply word order rules** (language-specific) 6. **Render** target AST to text ### Morphological Agreement The system automatically handles agreement: ```elixir alias Nasty.Translation.Translator alias Nasty.Rendering.Text # English: "the cats" # Spanish: "los gatos" (masculine plural determiner + noun) {:ok, doc_en} = Nasty.parse("The cats.", language: :en) {:ok, doc_es} = Translator.translate_document(doc_en, :es) {:ok, text_es} = Text.render(doc_es) # => "Los gatos." # English: "the big houses" # Spanish: "las casas grandes" (feminine plural, adjective after noun) {:ok, doc_en} = Nasty.parse("The big houses.", language: :en) {:ok, doc_es} = Translator.translate_document(doc_en, :es) {:ok, text_es} = Text.render(doc_es) # => "Las casas grandes." ``` ### Word Order Transformations Language-specific word order is automatically applied: ```elixir alias Nasty.Translation.Translator alias Nasty.Rendering.Text # English: Adjective before noun # Spanish: Most adjectives after noun {:ok, doc_en} = Nasty.parse("The red car.", language: :en) {:ok, doc_es} = Translator.translate_document(doc_en, :es) {:ok, text_es} = Text.render(doc_es) # => "El carro rojo." (car red) # Some adjectives stay before noun {:ok, doc_en} = Nasty.parse("The good book.", language: :en) {:ok, doc_es} = Translator.translate_document(doc_en, :es) {:ok, text_es} = Text.render(doc_es) # => "El buen libro." (good stays before) ``` ### Roundtrip Translation Translations preserve grammatical structure for roundtrips: ```elixir alias Nasty.Translation.Translator alias Nasty.Rendering.Text original = "The cat runs quickly." # English -> Spanish -> English {:ok, doc_en} = Nasty.parse(original, language: :en) {:ok, doc_es} = Translator.translate_document(doc_en, :es) {:ok, doc_en2} = Translator.translate_document(doc_es, :en) {:ok, result} = Text.render(doc_en2) IO.puts(original) IO.puts(result) # Original: "The cat runs quickly." # Result: "The cat runs quickly." (or close equivalent) ``` ### Supported Language Pairs Currently supported: - English ↔ Spanish - English ↔ Catalan - Spanish ↔ Catalan (via English) ### Custom Lexicons Extend lexicons with domain-specific vocabulary: ```elixir # Lexicons are in priv/translation/lexicons/ # Format: en_es.exs, es_en.exs, etc. # Add entries in priv/translation/lexicons/en_es.exs: %{ noun: %{ "widget" => "dispositivo", "gadget" => "aparato" }, verb: %{ "deploy" => "desplegar", "compile" => "compilar" } } ``` ### Translation Limitations **Current limitations:** - Idiomatic expressions may not translate well - Complex verb tenses may need manual review - Cultural context not preserved - Ambiguous words use first lexicon entry **Best practices:** - Translate sentence by sentence for best results - Review translations for idiomatic expressions - Extend lexicons for domain-specific terms - Use for technical/formal text rather than creative writing ## AST Manipulation ### Traversal Walk the AST tree: ```elixir alias Nasty.Utils.Traversal # Count all tokens token_count = Traversal.reduce(document, 0, fn %Nasty.AST.Token{}, acc -> acc + 1 _, acc -> acc end) # Collect all verbs verbs = Traversal.collect(document, fn %Nasty.AST.Token{pos_tag: :verb} -> true _ -> false end) # Find first question question = Traversal.find(document, fn %Nasty.AST.Sentence{function: :interrogative} -> true _ -> false end) # Transform tree (lowercase all text) lowercased = Traversal.map(document, fn %Nasty.AST.Token{} = token -> %{token | text: String.downcase(token.text)} node -> node end) # Breadth-first traversal nodes = Traversal.walk_breadth(document, [], fn node, acc -> {:cont, [node | acc]} end) ``` ### Queries High-level querying API: ```elixir alias Nasty.Utils.Query # Find by type noun_phrases = Query.find_all(document, :noun_phrase) sentences = Query.find_all(document, :sentence) # Find by POS tag nouns = Query.find_by_pos(document, :noun) verbs = Query.find_by_pos(document, :verb) # Find by text pattern cats = Query.find_by_text(document, "cat") words_starting_with_s = Query.find_by_text(document, ~r/^s/i) # Find by lemma runs = Query.find_by_lemma(document, "run") # Matches "run", "runs", "running" # Extract entities all_entities = Query.extract_entities(document) people = Query.extract_entities(document, type: :PERSON) organizations = Query.extract_entities(document, type: :ORG) # Structural queries subject = Query.find_subject(sentence) verb = Query.find_main_verb(sentence) objects = Query.find_objects(sentence) # Count nodes token_count = Query.count(document, :token) sentence_count = Query.count(document, :sentence) # Content vs function words content_words = Query.content_words(document) function_words = Query.function_words(document) # Custom predicates long_words = Query.filter(document, fn %Nasty.AST.Token{text: text} -> String.length(text) > 7 _ -> false end) ``` ### Transformations Modify AST structures: ```elixir alias Nasty.Utils.Transform # Case normalization lowercased = Transform.normalize_case(document, :lower) uppercased = Transform.normalize_case(document, :upper) titled = Transform.normalize_case(document, :title) # Remove punctuation no_punct = Transform.remove_punctuation(document) # Remove stop words no_stops = Transform.remove_stop_words(document) # Custom stop words custom_stops = ["the", "a", "an"] filtered = Transform.remove_stop_words(document, custom_stops) # Lemmatize all tokens lemmatized = Transform.lemmatize(document) # Replace tokens masked = Transform.replace_tokens( document, fn token -> token.pos_tag == :propn end, fn token -> %{token | text: "[MASK]"} end ) # Transformation pipelines processed = Transform.pipeline(document, [ &Transform.normalize_case(&1, :lower), &Transform.remove_punctuation/1, &Transform.remove_stop_words/1, &Transform.lemmatize/1 ]) # Round-trip testing {:ok, transformed} = Transform.round_trip_test(document, fn doc -> Transform.normalize_case(doc, :lower) end) ``` ### Validation Ensure AST integrity: ```elixir alias Nasty.Utils.Validator # Validate structure case Validator.validate(document) do {:ok, doc} -> IO.puts("Valid!") {:error, reason} -> IO.puts("Invalid: #{reason}") end # Check validity (boolean) if Validator.valid?(document) do IO.puts("Document is valid") end # Validate spans case Validator.validate_spans(document) do :ok -> IO.puts("Spans are consistent") {:error, reason} -> IO.puts("Span error: #{reason}") end # Validate language consistency case Validator.validate_language(document) do :ok -> IO.puts("Language is consistent") {:error, reason} -> IO.puts("Language error: #{reason}") end # Validate and raise Validator.validate!(document) # Raises on error ``` ## Visualization and Debugging ### Pretty Printing Debug AST structures: ```elixir alias Nasty.Rendering.PrettyPrint # Indented output IO.puts(PrettyPrint.print(document)) # With colors IO.puts(PrettyPrint.print(document, color: true)) # Limit depth IO.puts(PrettyPrint.print(document, max_depth: 3)) # Show spans IO.puts(PrettyPrint.print(document, show_spans: true)) # Tree-style output IO.puts(PrettyPrint.tree(document)) # Statistics IO.puts(PrettyPrint.stats(document)) ``` ### Graphviz Visualization Export to DOT format for visual rendering: ```elixir alias Nasty.Rendering.Visualization # Parse tree dot = Visualization.to_dot(document, type: :parse_tree) File.write("parse_tree.dot", dot) # Then: dot -Tpng parse_tree.dot -o parse_tree.png # Dependency graph deps_dot = Visualization.to_dot(sentence, type: :dependencies, rankdir: "LR" ) File.write("dependencies.dot", deps_dot) # Entity graph entity_dot = Visualization.to_dot(document, type: :entities) File.write("entities.dot", entity_dot) # Custom options dot = Visualization.to_dot(document, type: :parse_tree, rankdir: "TB", show_pos_tags: true, show_spans: false ) ``` ### JSON Export Export for web visualization: ```elixir alias Nasty.Rendering.Visualization # Export to JSON (for d3.js, etc.) json = Visualization.to_json(document) File.write("document.json", json) # Can be loaded in JavaScript: # fetch('document.json') # .then(r => r.json()) # .then(data => visualize(data)) ``` ### Text Rendering Convert AST back to text: ```elixir alias Nasty.Rendering.Text # Basic rendering {:ok, text} = Text.render(document) # Or use language-specific rendering alias Nasty.Language.English {:ok, text} = English.render(document) # For specific languages alias Nasty.Language.{Spanish, Catalan} {:ok, text_es} = Spanish.render(document) {:ok, text_ca} = Catalan.render(document) ``` ## Statistical & Neural Models ### Using Pretrained Models Load and use statistical and neural models: ```elixir alias Nasty.Language.English # Automatic loading (looks in priv/models/) {:ok, tokens} = English.tokenize(text) # HMM statistical model (~95% accuracy) {:ok, tagged} = English.tag_pos(tokens, model: :hmm) # Neural model (97-98% accuracy) {:ok, tagged} = English.tag_pos(tokens, model: :neural) # Ensemble mode (combines neural + HMM + rule-based) {:ok, tagged} = English.tag_pos(tokens, model: :ensemble) # PCFG statistical parsing {:ok, document} = English.parse(tagged, model: :pcfg) # CRF-based named entity recognition alias Nasty.Language.English.EntityRecognizer entities = EntityRecognizer.recognize(tagged, model: :crf) ``` ### Training Custom Models Train on your own data: ```bash # Download Universal Dependencies data wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4611/ud-treebanks-v2.10.tgz # Extract tar -xzf ud-treebanks-v2.10.tgz # Train HMM POS tagger (fast, 95% accuracy) mix nasty.train.pos \ --corpus ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-train.conllu \ --test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu \ --output priv/models/en/my_hmm_model.model # Train neural POS tagger (slower, 97-98% accuracy) mix nasty.train.neural_pos \ --corpus ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-train.conllu \ --output priv/models/en/my_neural_model.axon \ --epochs 10 \ --batch-size 32 # Train PCFG parser mix nasty.train.pcfg \ --corpus ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-train.conllu \ --test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu \ --output priv/models/en/my_pcfg.model \ --smoothing 0.001 # Train CRF for named entity recognition mix nasty.train.crf \ --corpus data/ner_train.conllu \ --test data/ner_test.conllu \ --output priv/models/en/my_crf_ner.model \ --task ner \ --iterations 100 # Evaluate models mix nasty.eval.pos \ --model priv/models/en/my_hmm_model.model \ --test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu mix nasty.eval \ --model priv/models/en/my_pcfg.model \ --test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu \ --type pcfg mix nasty.eval \ --model priv/models/en/my_crf_ner.model \ --test data/ner_test.conllu \ --type crf \ --task ner ``` For detailed training instructions: - Neural models: [TRAINING_NEURAL.md](TRAINING_NEURAL.md) - PCFG and CRF models: [STATISTICAL_MODELS.md](STATISTICAL_MODELS.md) ### Model Management ```bash # List models mix nasty.models list # Inspect model mix nasty.models inspect priv/models/en/pos_hmm_v1.model # Compare models mix nasty.models compare model1.model model2.model ``` ## Performance Tips ### Batch Processing Process multiple texts efficiently: ```elixir alias Nasty.Language.English texts = [ "First sentence.", "Second sentence.", "Third sentence." ] # Process in parallel results = texts |> Task.async_stream(fn text -> with {:ok, tokens} <- English.tokenize(text), {:ok, tagged} <- English.tag_pos(tokens), {:ok, doc} <- English.parse(tagged) do {:ok, doc} end end, max_concurrency: System.schedulers_online()) |> Enum.map(fn {:ok, result} -> result end) ``` ### Selective Parsing Skip expensive operations when not needed: ```elixir # Basic parsing (no semantic analysis) {:ok, doc} = English.parse(tokens) # With semantic roles {:ok, doc} = English.parse(tokens, semantic_roles: true) # With coreference {:ok, doc} = English.parse(tokens, coreference: true) # Full pipeline {:ok, doc} = English.parse(tokens, semantic_roles: true, coreference: true ) ``` ### Caching Cache parsed documents: ```elixir defmodule MyApp.DocumentCache do use Agent def start_link(_) do Agent.start_link(fn -> %{} end, name: __MODULE__) end def get_or_parse(text) do Agent.get_and_update(__MODULE__, fn cache -> case Map.fetch(cache, text) do {:ok, doc} -> {doc, cache} :error -> {:ok, tokens} = English.tokenize(text) {:ok, tagged} = English.tag_pos(tokens) {:ok, doc} = English.parse(tagged) {doc, Map.put(cache, text, doc)} end end) end end ``` ## Troubleshooting ### Common Issues **Issue**: Parsing fails with long sentences **Solution**: Break into smaller sentences or increase timeout ```elixir # Split long text sentences = String.split(text, ~r/[.!?]+/) Enum.map(sentences, &English.parse/1) ``` **Issue**: Entity recognition misses entities **Solution**: Train custom NER or add to dictionary ```elixir # Add custom entity patterns alias Nasty.Language.English.EntityRecognizer # This is conceptual - check actual API EntityRecognizer.add_pattern(:ORG, ~r/\b[A-Z][a-z]+ Inc\.\b/) ``` **Issue**: POS tagging accuracy is low **Solution**: Use statistical model or ensemble ```elixir # Use HMM model {:ok, tagged} = English.tag_pos(tokens, model: :hmm) # Or ensemble {:ok, tagged} = English.tag_pos(tokens, model: :ensemble) ``` ### Debugging Tips 1. **Visualize the AST**: Use pretty printing to understand structure 2. **Check spans**: Ensure position tracking is correct 3. **Validate**: Run validation to catch structural issues 4. **Incremental parsing**: Test each pipeline stage separately ```elixir # Debug pipeline stage by stage {:ok, tokens} = English.tokenize(text) IO.inspect(tokens, label: "Tokens") {:ok, tagged} = English.tag_pos(tokens) IO.inspect(tagged, label: "Tagged") {:ok, doc} = English.parse(tagged) IO.puts(PrettyPrint.tree(doc)) ``` ### Getting Help - Check the [API documentation](https://hexdocs.pm/nasty/) - Review [PLAN.md](../PLAN.md) for architecture details - See [examples/](../examples/) for working code - Report issues on [GitHub](https://github.com/am-kantox/nasty/issues) ## Next Steps - Explore the [examples/](../examples/) directory for more demos - Read [STATISTICAL_MODELS.md](STATISTICAL_MODELS.md) for ML details - Check [TRAINING_GUIDE.md](TRAINING_GUIDE.md) to train custom models - See [INTEROP_GUIDE.md](INTEROP_GUIDE.md) for code conversion details Happy parsing!