dataprep/prep

dataprep/prep — infallible transformations on a single value.

Reach for this module when the operation always succeeds: trim, lowercase, collapse whitespace, replace substrings, fall back to a default. Compose with then / sequence.

For fallible checks ("is this non-empty?", "does this match a pattern?") use dataprep/validator. The two compose cleanly — see doc/architecture.md for the decision table, the canonical Prep → Validator pipeline recipe, and a worked end-to-end example.

Types

Prep(a) is an infallible transformation: fn(a) -> a. It always succeeds and never produces errors.

pub type Prep(a) =
  fn(a) -> a

Values

pub fn collapse_space() -> fn(String) -> String

Collapse consecutive ASCII whitespace into a single space.

Matches the POSIX whitespace class [ \t\n\r\f\v] (space, tab, linefeed, carriage return, form feed, vertical tab). Unicode whitespace such as NO-BREAK SPACE (U+00A0) and IDEOGRAPHIC SPACE (U+3000) is preserved — for those use collapse_unicode_space (it matches the wider Unicode \s set and replaces every run with a single ASCII space).

This split avoids the silent CJK-destruction footgun of replacing 姓 名 (with U+3000 between the names) with 姓 名 when the caller only meant to normalise indentation.

Uses let assert for the regex compilation. The pattern is a fixed, known-valid regular expression, so compilation cannot fail at runtime. The assert is intentional and safe.

pub fn collapse_unicode_space() -> fn(String) -> String

Collapse consecutive Unicode whitespace into a single ASCII space.

Matches \s+ under the regex engine’s full Unicode rule, so it recognises NO-BREAK SPACE (U+00A0), IDEOGRAPHIC SPACE (U+3000), LINE / PARAGRAPH SEPARATOR (U+2028 / U+2029), the various EN/EM SPACEs (U+2000..U+200A), etc. Each run — even one made entirely of non-ASCII whitespace — is rewritten to a single ASCII U+0020.

Reach for collapse_space instead when the caller wants to keep CJK / typographic whitespace intact and only fold ASCII runs.

Uses let assert for the regex compilation. The pattern \s+ is a fixed, known-valid regular expression, so compilation cannot fail at runtime. The assert is intentional and safe.

pub fn default(fallback: String) -> fn(String) -> String

Replace the value with fallback when the input is exactly the literal empty string "".

Whitespace-only inputs like " ", "\t", " \n " are passed through unchanged — only s == "" triggers the fallback. Reach for default_when_blank instead when you want the broader "missing or whitespace-only" check, or compose with trim:

prep.trim() |> prep.then(first: _, next: prep.default(“N/A”))

pub fn default_when_blank(
  fallback: String,
) -> fn(String) -> String

Replace the value with fallback when the input is the literal empty string "" or consists only of whitespace (per string.trim).

Examples that fire the fallback: "", " ", "\t", "\r\n", " \n ". Examples that do not: "a", " a ", "\t hello".

Equivalent to prep.trim() |> prep.then(prep.default(fallback)) when the trimmed value is what the caller wants to keep on the non-blank path. The dedicated helper preserves the original (un-trimmed) input on the non-blank path, which matches the default posture: only substitute, never edit. Use the explicit trim |> default composition when the trimmed form is the desired output.

pub fn identity() -> fn(a) -> a

No-op prep. Returns the value unchanged.

pub fn lowercase() -> fn(String) -> String

Convert to lowercase.

pub fn replace(
  target target: String,
  replacement replacement: String,
) -> fn(String) -> String

Replace all occurrences of target with replacement.

pub fn sequence(steps: List(fn(a) -> a)) -> fn(a) -> a

Compose a list of preps into a single prep.

identity() is the identity element of sequential composition, so sequence([]) returns a prep that leaves every input unchanged. This is a deliberate monoid law (see test/dataprep/laws_test.gleam) and lets callers build prep lists incrementally — for example via list.filter(all_preps, by_feature_flag) — without a special case when the resulting list happens to be empty.

pub fn then(
  first p1: fn(a) -> a,
  next p2: fn(a) -> a,
) -> fn(a) -> a

Sequential composition: apply p1, then apply p2 to the result.

pub fn trim() -> fn(String) -> String

Trim leading and trailing whitespace.

pub fn uppercase() -> fn(String) -> String

Convert to uppercase.

Search Document