Recollect.Pipeline.Chunker (recollect v0.5.1)

Copy Markdown View Source

Splits document text into chunks suitable for embedding and retrieval.

Uses a paragraph/section-based strategy with configurable target token count and overlap. Preserves section headings as context metadata.

Summary

Functions

Split text into chunks.

Estimate the token count for a string.

Types

t()

@type t() :: %Recollect.Pipeline.Chunker{
  content: String.t(),
  end_offset: non_neg_integer(),
  heading_context: String.t() | nil,
  sequence: non_neg_integer(),
  start_offset: non_neg_integer(),
  token_count: non_neg_integer()
}

Functions

chunk(text, opts \\ [])

@spec chunk(
  String.t(),
  keyword()
) :: [t()]

Split text into chunks.

Options

  • :target_tokens - target token count per chunk (default: 512)
  • :overlap_tokens - token overlap between chunks (default: 50)

estimate_tokens(text)

@spec estimate_tokens(String.t()) :: non_neg_integer()

Estimate the token count for a string.