Splits document text into chunks suitable for embedding and retrieval.
Uses a paragraph/section-based strategy with configurable target token count and overlap. Preserves section headings as context metadata.
Summary
Types
@type t() :: %Recollect.Pipeline.Chunker{ content: String.t(), end_offset: non_neg_integer(), heading_context: String.t() | nil, sequence: non_neg_integer(), start_offset: non_neg_integer(), token_count: non_neg_integer() }
Functions
Split text into chunks.
Options
:target_tokens- target token count per chunk (default: 512):overlap_tokens- token overlap between chunks (default: 50)
@spec estimate_tokens(String.t()) :: non_neg_integer()
Estimate the token count for a string.