TextChunker (TextChunker v0.4.0)
View SourceProvides a high-level interface for text chunking, employing a configurable splitting strategy (defaults to recursive splitting). Manages options and coordinates the process, tracking chunk metadata.
Key Features
- Customizable Splitting: Allows the splitting strategy to be customized via the
:strategy
option. - Size and Overlap Control: Provides options for
:chunk_size
and:chunk_overlap
. - Metadata Tracking: Generates
Chunk
structs containing byte range information.
Supported Options
:chunk_size
(positive integer, default: 2000) - Maximum size in token length for each chunk.:get_chunk_size
(function, default: &String.length/1) - A function that returns the number of tokens in a chunk, by default the number of code points.:chunk_overlap
(non-negative integer, default: 200) - Number of overlapping tokens between consecutive chunks to preserve context.:strategy
(module default:RecursiveChunk
) - A module implementing the split function. Currently onlyRecursiveChunk
is supported.:format
(atom, default::plaintext
) - The format of the input text. Used to determine where to split the text in some strategies.
Summary
Functions
Splits the provided text into a list of %Chunk{}
structs.
Functions
Splits the provided text into a list of %Chunk{}
structs.
Examples
iex> long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."
iex> TextChunker.split(long_text)
# => [%Chunk{}, %Chunk{}, ...]
iex> TextChunker.split(long_text, chunk_size: 10, chunk_overlap: 3)
# => Generates many smaller chunks with significant overlap