# Grammar Customization Guide This document explains how to customize and extend Nasty's grammar rules by creating external grammar resource files. ## Overview Starting with version 0.2.0, Nasty externalizes grammar rules from hardcoded Elixir modules into configurable `.exs` resource files. This allows you to: - Customize existing grammar rules without modifying source code - Create domain-specific grammar variants (e.g., legal, medical, technical) - Add support for new languages - A/B test different parsing strategies - Share grammar rule sets across projects ## Architecture Grammar rules are stored as Elixir term files (`.exs`) in: ``` priv/languages/{language_code}/grammars/{rule_type}.exs ``` For variants (e.g., formal, informal, technical): ``` priv/languages/{language_code}/variants/{variant_name}/{rule_type}.exs ``` ### Language Codes - English: `en` or `english` - Spanish: `es` or `spanish` - Catalan: `ca` or `catalan` (future) ### Rule Types Each language can have the following grammar rule files: 1. `phrase_rules.exs` - Phrase structure patterns (NP, VP, PP, AdjP, AdvP) 2. `dependency_rules.exs` - Universal Dependencies relations and extraction rules 3. `coordination_rules.exs` - Coordinating conjunctions and coordination patterns 4. `subordination_rules.exs` - Subordinating conjunctions and subordinate clause patterns ## Grammar Loader API ### Loading Grammar Rules ```elixir alias Nasty.Language.GrammarLoader # Load default grammar rules {:ok, rules} = GrammarLoader.load(:en, :phrase_rules) # Load with variant {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: "formal") # Force reload (bypass cache) {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, force_reload: true) ``` ### Cache Management ```elixir # Clear all cached grammar GrammarLoader.clear_cache() # Clear specific cached rules GrammarLoader.clear_cache(:en, :phrase_rules, :default) ``` ### Direct File Loading ```elixir # Load from custom path {:ok, rules} = GrammarLoader.load_file("/path/to/custom_rules.exs") ``` ## Creating Grammar Files ### File Structure Grammar files are Elixir term files that evaluate to a map: ```elixir %{ # Top-level keys define rule categories rule_category_1: [...], rule_category_2: %{...}, # Metadata notes: %{ key: "description" } } ``` ### Example: Simple Phrase Rules Create `priv/languages/en/grammars/custom_phrase_rules.exs`: ```elixir %{ # Noun phrase patterns noun_phrases: [ # Simple NP: Det + Noun {:np, [:det, :noun]}, # NP with adjective: Det + Adj + Noun {:np, [:det, :adj, :noun]}, # NP with PP: Det + Noun + PP {:np, [:det, :noun, :pp]} ], # Verb phrase patterns verb_phrases: [ # Simple VP: just Verb {:vp, [:verb]}, # VP with object: Verb + NP {:vp, [:verb, :np]}, # VP with auxiliary: Aux + Verb {:vp, [:aux, :verb]} ], notes: %{ version: "1.0.0", author: "Your Name", description: "Custom phrase rules for domain-specific parsing" } } ``` ## English Grammar Reference ### Phrase Rules (`phrase_rules.exs`) See `priv/languages/en/grammars/phrase_rules.exs` for the complete reference. Key sections: ```elixir %{ noun_phrases: [ # List of NP patterns {:np, [:det, :noun]}, {:np, [:det, :adj, :noun]}, # ... ], verb_phrases: [ # List of VP patterns {:vp, [:verb]}, {:vp, [:aux, :verb, :np]}, # ... ], prepositional_phrases: [ # PP patterns {:pp, [:prep, :np]}, # ... ], adjectival_phrases: [ # AdjP patterns {:adjp, [:adv, :adj]}, # ... ], adverbial_phrases: [ # AdvP patterns {:advp, [:adv]}, # ... ], relative_clauses: [ # Relative clause patterns {:relative_clause, [:relative_marker, :clause]}, # ... ], special_rules: [ # Special handling rules {:comparative_than, :pseudo_prep}, # ... ] } ``` ### Dependency Rules (`dependency_rules.exs`) See `priv/languages/en/grammars/dependency_rules.exs` for the complete reference. Key sections: ```elixir %{ core_arguments: [ # Subject, object, complements %{ relation: :nsubj, description: "Nominal subject", head_pos: [:verb], dependent_pos: [:noun, :propn, :pron], example: "The cat sleeps → nsubj(sleeps, cat)" }, # ... ], nominal_dependents: [ # Determiners, modifiers %{relation: :det, ...}, %{relation: :amod, ...}, # ... ], function_words: [ # Auxiliaries, copulas, markers %{relation: :aux, ...}, # ... ], extraction_priorities: [ # Order of dependency extraction :nsubj, :obj, :det, :amod, # ... ] } ``` ### Coordination Rules (`coordination_rules.exs`) Key sections: ```elixir %{ coordinating_conjunctions: [ %{ conjunction: "and", type: :copulative, example: "cats and dogs" }, # ... ], coordination_patterns: [ %{ pattern: :np_coordination, structure: "NP CCONJ NP", example: "cats and dogs" }, # ... ], special_cases: [ # Correlative conjunctions, etc. %{ type: :correlative, patterns: [ %{pair: ["both", "and"], example: "both cats and dogs"}, # ... ] } ] } ``` ### Subordination Rules (`subordination_rules.exs`) Key sections: ```elixir %{ subordinating_conjunctions: [ %{ conjunction: "because", type: :causal, example: "I stayed because it rained" }, # ... ], relative_markers: [ %{ marker: "who", type: :relative_pronoun, example: "the person who came" }, # ... ], subordinate_clause_types: [ %{ type: :adverbial, dependency_relation: :advcl, subtypes: [:temporal, :causal, :conditional, ...] }, # ... ] } ``` ## Spanish Grammar Reference Spanish grammar files follow the same structure but include Spanish-specific features: - Post-nominal adjectives: `la casa roja` (the red house) - Pro-drop: null subjects allowed - Flexible word order: SVO, VSO, VOS - Clitic pronouns: `dámelo` (give-me-it) - Personal 'a': `Veo a Juan` (I see Juan) - Two copulas: `ser` vs. `estar` - Phonetic variants: `y`→`e`, `o`→`u` before vowels See files in `priv/languages/es/grammars/` for complete Spanish grammar. ## Creating Domain-Specific Variants ### Example: Technical English Create `priv/languages/en/variants/technical/phrase_rules.exs`: ```elixir %{ # Inherit base rules and add technical-specific patterns noun_phrases: [ # Standard patterns {:np, [:det, :noun]}, # Technical compound nouns (e.g., "TCP/IP protocol") {:np, [:propn, :noun]}, {:np, [:propn, :sym, :propn, :noun]}, # Noun phrases with technical modifiers {:np, [:num, {:unit, [:noun]}, :noun]}, # "5 GB memory" # Multi-word technical terms {:np, [{:many, :noun}]} # "machine learning model" ], verb_phrases: [ # Standard patterns {:vp, [:verb, :np]}, # Technical action verbs (instantiate, serialize, etc.) {:vp, [:tech_verb, :np, :pp]}, # Passive constructions common in technical writing {:vp, [:aux, :verb, :pp]} ], notes: %{ domain: "technical", use_case: "Software documentation, API specs, technical papers" } } ``` ### Example: Legal English ```elixir %{ noun_phrases: [ # Legal entities {:np, [:det, :legal_entity]}, # "the plaintiff", "the defendant" # Complex legal terms {:np, [:det, :adj, :legal_term, :pp]}, # "the aforementioned contractual obligation" # References (Section X, Article Y) {:np, [:legal_ref_type, :num]} # "Section 5" ], subordination_patterns: [ # Legal conditionals (provided that, in the event that) {:conditional, :multiword_legal_conj} ], notes: %{ domain: "legal", use_case: "Contracts, legislation, court documents" } } ``` ## Using Custom Grammar in Code ### Option 1: Load and Use Directly ```elixir # Load custom grammar {:ok, custom_phrase_rules} = GrammarLoader.load(:en, :custom_phrase_rules) # Use in your parser custom_np_patterns = custom_phrase_rules.noun_phrases # Process with custom patterns... ``` ### Option 2: Extend Parser Module ```elixir defmodule MyApp.CustomParser do alias Nasty.Language.GrammarLoader def parse_technical_text(text) do # Load technical variant {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: "technical") # Parse using custom rules # ... your parsing logic using rules ... end end ``` ### Option 3: Runtime Configuration ```elixir # In config/config.exs config :nasty, default_grammar_variant: "technical" # In your code variant = Application.get_env(:nasty, :default_grammar_variant, :default) {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: variant) ``` ## Grammar Validation The grammar loader validates that all files return a map: ```elixir # Valid %{ rules: [...], notes: %{} } # Invalid - will raise error [1, 2, 3] # Not a map ``` For more complex validation, extend `GrammarLoader.validate_rules/1`. ## Best Practices ### 1. Start with Base Grammar Copy existing grammar files and modify rather than starting from scratch: ```bash cp priv/languages/en/grammars/phrase_rules.exs \ priv/languages/en/variants/custom/phrase_rules.exs ``` ### 2. Document Your Rules Include comprehensive notes in your grammar files: ```elixir %{ rules: [...], notes: %{ version: "1.0.0", author: "Team Name", created: "2026-01-08", description: "Custom grammar for medical text parsing", changes: [ "Added medical entity patterns", "Extended VP patterns for medical procedures" ], examples: [ "The patient underwent cardiac catheterization", "Diagnose: Type 2 diabetes mellitus" ] } } ``` ### 3. Test Your Grammar Create tests for custom grammar: ```elixir defmodule MyApp.CustomGrammarTest do use ExUnit.Case alias Nasty.Language.GrammarLoader test "custom grammar loads successfully" do assert {:ok, rules} = GrammarLoader.load(:en, :custom_rules) assert is_map(rules) assert Map.has_key?(rules, :noun_phrases) end test "custom grammar includes domain patterns" do {:ok, rules} = GrammarLoader.load(:en, :custom_rules, variant: "medical") assert Enum.any?(rules.noun_phrases, fn pattern -> # Check for medical-specific patterns end) end end ``` ### 4. Version Your Grammar Track grammar versions for reproducibility: ```elixir %{ metadata: %{ version: "2.1.0", compatible_with: "nasty >= 0.2.0" }, # ... rules ... } ``` ### 5. Keep Grammar Files Focused Separate concerns across different rule types: - Phrase structure → `phrase_rules.exs` - Dependencies → `dependency_rules.exs` - Coordination → `coordination_rules.exs` - Subordination → `subordination_rules.exs` Don't mix all rules into one file. ## Performance Considerations ### Caching Grammar files are cached in ETS after first load: ```elixir # First load: reads from disk {:ok, rules} = GrammarLoader.load(:en, :phrase_rules) # ~5ms # Subsequent loads: from cache {:ok, rules} = GrammarLoader.load(:en, :phrase_rules) # ~0.1ms ``` Clear cache when updating grammar during development: ```elixir GrammarLoader.clear_cache() ``` ### File Size Keep grammar files under 1MB for fast loading. If needed, split into multiple files: ``` phrase_rules_np.exs # Noun phrase patterns phrase_rules_vp.exs # Verb phrase patterns phrase_rules_pp.exs # Prepositional phrase patterns ``` ## Troubleshooting ### Grammar File Not Found ``` Grammar file not found: .../en/grammars/missing_rules.exs, using empty rules ``` **Solution**: Check file exists and path is correct. Grammar files must be in `priv/languages/{lang}/grammars/`. ### Invalid Grammar Format ``` ** (ArgumentError) Grammar rules must be a map, got: [...] ``` **Solution**: Ensure file evaluates to a map: ```elixir # Correct %{rules: [...]} # Wrong [...] ``` ### Compilation Errors ``` ** (SyntaxError) invalid syntax ``` **Solution**: Grammar files must be valid Elixir. Test with: ```bash elixir priv/languages/en/grammars/your_rules.exs ``` ### Cache Issues If changes to grammar files aren't reflected: ```elixir # Clear cache Nasty.Language.GrammarLoader.clear_cache() # Or force reload {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, force_reload: true) ``` ## Examples Repository See working examples in the main repository: - English grammar: `priv/languages/en/grammars/` - Spanish grammar: `priv/languages/es/grammars/` - Test fixtures: `test/fixtures/grammars/` ## Contributing Custom Grammars To contribute grammar variants to the Nasty project: 1. Create grammar files following the structure above 2. Add tests demonstrating the grammar works 3. Document the use case and domain 4. Submit a pull request to the main repository ## Further Reading - [PARSING_GUIDE.md](PARSING_GUIDE.md) - Understanding the parsing pipeline - [ENGLISH_GRAMMAR.md](languages/ENGLISH_GRAMMAR.md) - English grammar specification - [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture overview - Universal Dependencies: https://universaldependencies.org/