Philter.UTF8 (Philter v0.3.0)

Copy Markdown View Source

UTF-8 safe truncation for streaming data.

When capturing previews of streamed responses, truncation may occur mid-way through a multi-byte UTF-8 character (e.g., "cafe" truncated at byte 4 splits the "e"). This module ensures truncated output remains valid UTF-8.

Summary

Functions

Ensures a binary is valid UTF-8, truncating invalid trailing bytes.

Truncates a binary to at most max_bytes, ensuring valid UTF-8.

Validates that a binary is valid UTF-8.

Functions

ensure_valid(binary)

@spec ensure_valid(binary()) :: {:ok, binary()} | {:error, :invalid_utf8}

Ensures a binary is valid UTF-8, truncating invalid trailing bytes.

If the binary ends with an incomplete UTF-8 sequence (common when truncating streaming data), those bytes are removed.

Returns {:ok, valid_binary} if successful, or {:error, :invalid_utf8} if the binary contains invalid UTF-8 that isn't just trailing bytes.

Examples

iex> Philter.UTF8.ensure_valid("hello")
{:ok, "hello"}

# Binary ending with incomplete UTF-8 sequence
iex> Philter.UTF8.ensure_valid("hello" <> <<0xC3>>)
{:ok, "hello"}

truncate(binary, max_bytes)

@spec truncate(binary(), non_neg_integer()) :: binary()

Truncates a binary to at most max_bytes, ensuring valid UTF-8.

If the binary would be truncated in the middle of a multi-byte character, the truncation point is moved backwards to the last complete character.

Returns the truncated binary.

Examples

iex> Philter.UTF8.truncate("hello", 10)
"hello"

iex> Philter.UTF8.truncate("hello", 3)
"hel"

# "e" is 2 bytes (C3 A9), truncating at 5 bytes preserves it
iex> Philter.UTF8.truncate("cafe", 5)
"cafe"

# Truncating at 4 bytes would split "e", so we get "caf"
iex> Philter.UTF8.truncate("cafe", 4)
"caf"

valid?(binary)

@spec valid?(binary()) :: boolean()

Validates that a binary is valid UTF-8.

Returns true if the binary is valid UTF-8, false otherwise.

Examples

iex> Philter.UTF8.valid?("hello")
true

iex> Philter.UTF8.valid?(<<0xFF, 0xFE>>)
false