UTF-8 safe truncation for streaming data.
When capturing previews of streamed responses, truncation may occur mid-way through a multi-byte UTF-8 character (e.g., "cafe" truncated at byte 4 splits the "e"). This module ensures truncated output remains valid UTF-8.
Summary
Functions
Ensures a binary is valid UTF-8, truncating invalid trailing bytes.
Truncates a binary to at most max_bytes, ensuring valid UTF-8.
Validates that a binary is valid UTF-8.
Functions
Ensures a binary is valid UTF-8, truncating invalid trailing bytes.
If the binary ends with an incomplete UTF-8 sequence (common when truncating streaming data), those bytes are removed.
Returns {:ok, valid_binary} if successful, or {:error, :invalid_utf8}
if the binary contains invalid UTF-8 that isn't just trailing bytes.
Examples
iex> Philter.UTF8.ensure_valid("hello")
{:ok, "hello"}
# Binary ending with incomplete UTF-8 sequence
iex> Philter.UTF8.ensure_valid("hello" <> <<0xC3>>)
{:ok, "hello"}
@spec truncate(binary(), non_neg_integer()) :: binary()
Truncates a binary to at most max_bytes, ensuring valid UTF-8.
If the binary would be truncated in the middle of a multi-byte character, the truncation point is moved backwards to the last complete character.
Returns the truncated binary.
Examples
iex> Philter.UTF8.truncate("hello", 10)
"hello"
iex> Philter.UTF8.truncate("hello", 3)
"hel"
# "e" is 2 bytes (C3 A9), truncating at 5 bytes preserves it
iex> Philter.UTF8.truncate("cafe", 5)
"cafe"
# Truncating at 4 bytes would split "e", so we get "caf"
iex> Philter.UTF8.truncate("cafe", 4)
"caf"
Validates that a binary is valid UTF-8.
Returns true if the binary is valid UTF-8, false otherwise.
Examples
iex> Philter.UTF8.valid?("hello")
true
iex> Philter.UTF8.valid?(<<0xFF, 0xFE>>)
false