# Chunking Strategies The `Rag.Chunker` behavior provides pluggable strategies for splitting text into chunks optimized for different use cases. ## Overview ```elixir alias Rag.Chunker alias Rag.Chunker.{Character, Sentence, Paragraph, Recursive} chunker = %Recursive{max_chars: 500} chunks = Chunker.chunk(chunker, text) ``` Each chunk is a `%Rag.Chunker.Chunk{}` struct: ```elixir %Rag.Chunker.Chunk{ content: String.t(), # The chunk text start_byte: non_neg_integer(), end_byte: non_neg_integer(), index: non_neg_integer(), metadata: map() # Chunker-specific metadata } ``` ## Chunkers ### 1. Character (`Rag.Chunker.Character`) Fixed-size chunks with smart boundary detection. ```elixir chunker = %Character{max_chars: 500, overlap: 50} Chunker.chunk(chunker, text) ``` **Options:** - `max_chars` - Maximum characters per chunk (default: 500) - `overlap` - Characters to overlap between chunks (default: 50) **Behavior:** 1. Splits at sentence boundaries (`.!?`) when possible 2. Falls back to word boundaries 3. Falls back to hard split at max_chars 4. Creates overlap for context preservation **Best for:** - Consistent embedding sizes - Unstructured text - Predictable chunk sizes ### 2. Sentence (`Rag.Chunker.Sentence`) Preserves complete sentences within chunks. ```elixir chunker = %Sentence{max_chars: 500, min_chars: 100} Chunker.chunk(chunker, text) ``` **Options:** - `max_chars` - Maximum characters per chunk (default: 500) - `min_chars` - Minimum characters before starting new chunk (optional) **Behavior:** 1. Splits on sentence boundaries 2. Combines sentences up to max_chars 3. If min_chars specified, continues until reaching minimum 4. Falls back to character-based if a sentence exceeds max_chars **Best for:** - Q&A systems - Well-structured prose - Semantic coherence ### 3. Paragraph (`Rag.Chunker.Paragraph`) Preserves paragraph structure and topic boundaries. ```elixir chunker = %Paragraph{max_chars: 500, min_chars: 100} Chunker.chunk(chunker, text) ``` **Options:** - `max_chars` - Maximum characters per chunk (default: 500) - `min_chars` - Minimum characters before starting new chunk (optional) **Behavior:** 1. Splits on paragraph boundaries (double newlines) 2. Combines short paragraphs if under min_chars 3. Falls back to sentence-based if paragraph exceeds max_chars **Best for:** - Articles and blog posts - Documentation - Topic-organized content ### 4. Recursive (`Rag.Chunker.Recursive`) Hierarchical splitting from paragraph to sentence to character. ```elixir chunker = %Recursive{max_chars: 500, min_chars: 100} Chunker.chunk(chunker, text) ``` **Options:** - `max_chars` - Maximum characters per chunk (default: 500) - `min_chars` - Minimum characters per chunk (optional) **Metadata:** ```elixir %{chunker: :recursive, hierarchy: :paragraph | :sentence | :character} ``` **Best for:** - Mixed content structures - Varying document formats - Smart hierarchy preservation ### 5. Semantic (`Rag.Chunker.Semantic`) Groups sentences by semantic similarity using embeddings. ```elixir alias Rag.Router alias Rag.Chunker.Semantic {:ok, router} = Router.new(providers: [:gemini]) embedding_fn = fn text -> {:ok, [embedding], _} = Router.execute(router, :embeddings, [text], []) embedding end chunker = %Semantic{embedding_fn: embedding_fn, threshold: 0.8, max_chars: 500} Chunker.chunk(chunker, text) ``` **Options:** - `embedding_fn` - **Required** function to generate embeddings - `threshold` - Similarity threshold for grouping (default: 0.8) - `max_chars` - Maximum characters per chunk (default: 500) **Behavior:** 1. Splits text into sentences 2. Generates embedding for each sentence 3. Groups sentences by cosine similarity 4. Continues adding while similarity >= threshold and under max_chars **Best for:** - Topic-focused chunks - High-quality RAG systems - When API cost is acceptable ### 6. Format-Aware (`Rag.Chunker.FormatAware`) Format-aware chunking using TextChunker for code and markup formats. ```elixir alias Rag.Chunker.FormatAware chunker = %FormatAware{format: :markdown, chunk_size: 500} Chunker.chunk(chunker, markdown_text) ``` **Options:** - `format` - Document format (default: :plaintext) - `chunk_size` - Maximum size in code points (default: 2000) - `chunk_overlap` - Overlap between chunks (default: 200) - `size_fn` - Custom size function `(String.t() -> integer())` (optional) **Note:** This chunker requires TextChunker: ```elixir {:text_chunker, "~> 0.5.2"} ``` ## Strategy Comparison | Strategy | Chunk Size | Structure | API Calls | Best For | |----------|-----------|-----------|-----------|----------| | Character | Consistent | May split thoughts | None | Predictable sizing | | Sentence | Variable | Complete thoughts | None | Q&A systems | | Paragraph | Variable | Topic boundaries | None | Structured docs | | Recursive | Variable | Smart hierarchy | None | Mixed content | | Semantic | Variable | Semantic groups | Yes | Topic coherence | | FormatAware | Variable | Format-aware | None | Code and markup | ## Overlap Demonstration ```elixir text = "First sentence. Second sentence. Third sentence. Fourth sentence." # No overlap Chunker.chunk(%Character{max_chars: 40, overlap: 0}, text) # With overlap Chunker.chunk(%Character{max_chars: 40, overlap: 20}, text) ``` Overlap helps: - Preserve context between chunks - Improve retrieval for information at chunk boundaries - Reduce information loss during splitting ## Position Validation ```elixir alias Rag.Chunker.Chunk chunker = %Character{max_chars: 100} chunks = Chunker.chunk(chunker, text) Enum.all?(chunks, fn chunk -> Chunk.valid?(chunk, text) end) ``` ## Complete Example ```elixir alias Rag.Chunker alias Rag.Chunker.{Character, Sentence, Paragraph, Recursive, Semantic} # Load document text = File.read!("document.md") # Try different strategies char_chunks = Chunker.chunk(%Character{max_chars: 500, overlap: 50}, text) sent_chunks = Chunker.chunk(%Sentence{max_chars: 500}, text) para_chunks = Chunker.chunk(%Paragraph{max_chars: 500}, text) rec_chunks = Chunker.chunk(%Recursive{max_chars: 500}, text) # Semantic chunking (requires embedding function) embedding_fn = fn text -> {:ok, [embedding], _} = Rag.Router.execute(router, :embeddings, [text], []) embedding end sem_chunks = Chunker.chunk(%Semantic{embedding_fn: embedding_fn, threshold: 0.75}, text) # Compare results for {name, chunks} <- [ {"Character", char_chunks}, {"Sentence", sent_chunks}, {"Paragraph", para_chunks}, {"Recursive", rec_chunks}, {"Semantic", sem_chunks} ] do avg_size = if length(chunks) > 0 do total = Enum.reduce(chunks, 0, fn c, acc -> acc + String.length(c.content) end) round(total / length(chunks)) else 0 end IO.puts("#{name}: #{length(chunks)} chunks, avg #{avg_size} chars") end ```