Credence.Rule.UnnecessaryGraphemeChunking (credence v0.2.0)

Copy Markdown

Detects inefficient string transformation pipelines that:

  1. Convert a UTF-8 binary into graphemes or codepoints
  2. Perform chunking or grouping operations on the resulting list
  3. Immediately reconstruct strings from those chunks

This pattern often indicates unnecessary intermediate allocations: binary → list → list of lists → binary

While correct, this transformation is usually avoidable and can often be replaced with a more direct sliding-window or binary-based approach.

Why this is a problem

Elixir strings are UTF-8 binaries. Converting them into grapheme lists:

String.graphemes("café")
# => ["c", "a", "f", "é"]

creates a full intermediate structure in memory. If we then chunk and rebuild strings, we are effectively doing:

binary  list  list of lists  binaries

which increases:

  • memory usage (multiple allocations)
  • CPU cost (repeated traversal)
  • garbage collection pressure

Example (flagged)

string
|> String.graphemes()
|> Enum.chunk_every(3, 1, :discard)
|> Enum.map(&Enum.join/1)

This:

  • expands the entire string into a list
  • builds overlapping sublists
  • reconstructs each substring separately

Better alternatives

1. Direct binary slicing (preferred when valid)

for i <- 0..String.length(string) - n do
  String.slice(string, i, n)
end

2. Single grapheme conversion (if Unicode safety is required)

graphemes = String.graphemes(string)

for i <- 0..(length(graphemes) - n) do
  graphemes
  |> Enum.slice(i, n)
  |> Enum.join()
end

3. Algorithmic restructuring

In many cases, substring generation is not needed at all and can be replaced with a streaming or incremental computation.

When NOT to flag

  • Small input sizes where clarity is more important than performance
  • One-off transformations in scripts or tests
  • Cases where grapheme correctness is explicitly required and simplicity is preferred