# Catalan Language Support Comprehensive Catalan language support for the Nasty NLP library. ## Status **Implemented (Phases 1-7):** - Tokenization with Catalan-specific features - POS tagging with Universal Dependencies tagset - Morphological analysis and lemmatization - Grammar resource files (phrase and dependency rules) - Phrase and sentence parsing (NP, VP, PP, clause detection) - Dependency extraction (Universal Dependencies relations) - Named entity recognition (PERSON, LOCATION, ORGANIZATION, DATE, MONEY, PERCENT) **Pending (Phase 8):** - Text summarization (stub implementation) - Coreference resolution - Semantic role labeling ## Features ### Tokenization The Catalan tokenizer handles all language-specific features: - **Interpunct (l·l)**: Kept as single token - Example: `"Col·laborar"` → `["Col·laborar"]` - Common in compound words: col·laborar, intel·ligent, il·lusió - **Apostrophe Contractions**: Separated as distinct tokens - Determiners: `l'` (el/la) - Prepositions: `d'` (de), `s'` (es/se) - Pronouns: `n'` (en), `m'` (me), `t'` (te) - Example: `"L'home d'or"` → `["L'", "home", "d'", "or"]` - **Article Contractions**: Recognized as single tokens - `del` = de + el - `al` = a + el - `pel` = per + el - Example: `"Vaig al mercat"` → `["Vaig", "al", "mercat"]` - **Diacritics**: Complete support for all 10 Catalan diacritics - Vowels: à, è, é, í, ï, ò, ó, ú, ü - Consonant: ç (ce trencada) - Unicode NFC normalization ### POS Tagging Rule-based POS tagger using Universal Dependencies tagset: - **Comprehensive Lexicon**: 300+ word forms - Articles, pronouns, prepositions - Common verbs, nouns, adjectives, adverbs - Function words and particles - **Verb Conjugations**: All tenses supported - Present, preterite, imperfect, future, conditional - Subjunctive mood patterns - Gerunds and past participles - **Context-Based Disambiguation** - Post-nominal adjective detection - Determiner-noun sequences - Preposition-noun patterns ### Morphology Morphological analyzer with lemmatization: - **Verb Classes**: 3 conjugation classes - `-ar` verbs: parlar → parlar, parlant → parlar - `-re` verbs: viure → viure, vivint → viure - `-ir` verbs: dormir → dormir, dormint → dormir - **Irregular Verbs**: Dictionary of 100+ forms - ser, estar, haver (auxiliaries) - anar, fer, dir, poder, voler (common verbs) - tenir, venir, veure (irregulars) - **Morphological Features** - Gender: masculine/feminine - Number: singular/plural - Tense: present, past, future, conditional, imperfect - Mood: indicative, conditional, subjunctive - Aspect: progressive, perfective ### Grammar Rules Externalized grammar files in `priv/languages/ca/grammars/`: **Phrase Rules** (`phrase_rules.exs`): - Noun phrases with post-nominal adjectives - Verb phrases with flexible word order - Prepositional, adjectival, adverbial phrases - Relative clause patterns - Special rules for Catalan-specific features **Dependency Rules** (`dependency_rules.exs`): - Universal Dependencies v2 relations - Core arguments (subject, object, indirect object) - Non-core dependents (oblique, adverbials) - Function word relations - Catalan-specific patterns (clitics, pro-drop) ## Usage ```elixir alias Nasty.Language.Catalan # Complete pipeline text = "El gat dorm al sofà." {:ok, tokens} = Catalan.tokenize(text) {:ok, tagged} = Catalan.tag_pos(tokens) {:ok, document} = Catalan.parse(tagged) # Extract entities alias Nasty.Language.Catalan.EntityRecognizer {:ok, entities} = EntityRecognizer.recognize(tagged) # => [%Entity{type: :person, text: "Joan Garcia", ...}] # Extract dependencies alias Nasty.Language.Catalan.DependencyExtractor sentences = document.paragraphs |> Enum.flat_map(& &1.sentences) deps = Enum.flat_map(sentences, &DependencyExtractor.extract/1) # => [%Dependency{relation: :nsubj, head: "dorm", dependent: "gat", ...}] # Individual components {:ok, tokens} = Catalan.Tokenizer.tokenize("El gat dorm al sofà.") {:ok, tagged} = Catalan.POSTagger.tag_pos(tokens) {:ok, analyzed} = Catalan.Morphology.analyze(tagged) # Access lemmas and features Enum.each(analyzed, fn token -> IO.puts("#{token.text} [#{token.pos_tag}] → #{token.lemma}") end) ``` ## Linguistic Features ### Word Order Catalan allows flexible word order while maintaining SVO as default: - **SVO** (Subject-Verb-Object): `"El gat menja peix"` (The cat eats fish) - **VSO** (Verb-Subject-Object): `"Menja el gat peix"` (Eats the cat fish) - emphatic - **VOS** (Verb-Object-Subject): `"Menja peix el gat"` (Eats fish the cat) - rare ### Pro-Drop Subject pronouns often omitted when context is clear: - `"Parla català"` (I/he/she/it speaks Catalan) - subject implicit - `"Hem anat al mercat"` (We have gone to the market) - subject implicit ### Post-Nominal Adjectives Descriptive adjectives typically follow nouns: - `"casa gran"` (big house) - `"llibre interessant"` (interesting book) - Exception: `"bon dia"` (good day) - some adjectives precede for emphasis ### Clitic Pronouns Pronouns can attach to verbs as clitics: - `"Dona'm el llibre"` (Give me the book) - m' = me - `"Digue-li la veritat"` (Tell him/her the truth) - li = him/her ## Test Coverage **74 tests, 0 failures** - Tokenization: 54 tests - Interpunct words - Apostrophe and article contractions - Diacritics - Position tracking - Edge cases - POS Tagging: 20 tests - Basic word classes - Verb conjugations - Catalan-specific features - Context-based tagging ## Implementation Details ### Phrase Parser (`lib/language/catalan/phrase_parser.ex` - 334 lines) - `parse_noun_phrase/2`: Handles quantifiers, determiners, adjectives, and post-modifiers - `parse_verb_phrase/2`: Processes auxiliaries, main verbs, objects, and complements - `parse_prep_phrase/2`: Parses preposition + noun phrase structures - Catalan-specific: Post-nominal adjectives, quantifying adjectives (molt, poc, algun, tot) ### Sentence Parser (`lib/language/catalan/sentence_parser.ex` - 281 lines) - `parse_sentences/2`: Sentence boundary detection and splitting - `parse_clause/2`: Subject and predicate extraction - Catalan subordinators: que, perquè, quan, on, si, encara, mentre, així, doncs, ja - Coordination: i, o, però, sinó, ni ### Dependency Extractor (`lib/language/catalan/dependency_extractor.ex` - 226 lines) - Extracts Universal Dependencies relations from parsed structures - Core relations: nsubj (nominal subject), obj (object), iobj (indirect object) - Modifiers: det (determiner), amod (adjectival modifier), advmod (adverbial modifier) - Function words: aux (auxiliary), case (case marking), mark (subordinating conjunction) - Coordination: cc (coordinating conjunction), conj (conjunct) ### Entity Recognizer (`lib/language/catalan/entity_recognizer.ex` - 285 lines) - Rule-based NER with 6 entity types - **PERSON**: Catalan titles (Sr., Sra., Dr., Dra., Don, Donya), capitalized name sequences - **LOCATION**: Catalan places (Barcelona, Catalunya, València, Girona, Tarragona, Lleida, Andorra) - **ORGANIZATION**: Indicators (banc, universitat, hospital, ajuntament, govern) - **DATE**: Catalan months and days (gener, febrer, març, dilluns, dimarts) - **MONEY**: Euro symbols (€, euros, dòlar, dòlars) - **PERCENT**: Percentage symbols (%, per cent) - Confidence scoring: 0.5-0.95 based on pattern strength ## Future Work (Phase 8 and Beyond) 1. **Summarizer**: Extractive and abstractive text summarization 2. **Coreference Resolution**: Link mentions across sentences 3. **Semantic Role Labeling**: Predicate-argument structure 4. **End-to-end Tests**: Integration tests for complete pipeline 5. **Advanced Entity Recognition**: ML-based NER with larger lexicons 6. **Question Answering**: Extractive QA for Catalan texts 7. **Text Classification**: Sentiment analysis, topic classification ## References - Universal Dependencies Catalan Treebank: [UD_Catalan-AnCora](https://github.com/UniversalDependencies/UD_Catalan-AnCora) - Catalan Grammar: Institut d'Estudis Catalans - Linguistic Patterns: Based on Central Catalan (Barcelona dialect) ## Language Code ISO 639-1: `ca` ISO 639-3: `cat` ## Contributing When enhancing Catalan support: 1. Maintain consistency with Spanish implementation patterns 2. Follow Universal Dependencies standards 3. Document Catalan-specific features 4. Add comprehensive tests for new functionality 5. Update this documentation