Snowball.Runtime (snowball v0.1.1)

Copy Markdown View Source

Runtime state and primitive operations for Snowball stemmers.

This is the Elixir analogue of the canonical BaseStemmer class found in the Python and JavaScript snowball runtimes. Generated stemmer modules call into the functions on this module to manipulate cursor position, test character groupings, dispatch to suffix tables (among), and perform slice replacements.

State model

Snowball is conceptually mutable — every command moves the cursor or rewrites the buffer. In Elixir we thread an immutable %Runtime{} through every primitive. Each primitive returns either the updated struct (success) or the atom :fail (failure). Generated code uses pattern matches like:

case Runtime.eq_s(state, "ing") do
  :fail -> :fail
  state -> state
end

to compose primitives.

Cursor units

All cursor positions and limits are byte offsets into the UTF-8 buffer. This matches the canonical -utf8 mode of the Snowball reference compiler and lets eq_s/2 perform a direct byte-level prefix match (UTF-8 is self-synchronising, so byte positions land on codepoint boundaries provided cursor moves are made via the public primitives).

State fields

  • current — the working buffer (UTF-8 binary).

  • cursor — the current scan position (byte offset).

  • limit — the forward limit (byte offset, exclusive).

  • limit_backward — the backward limit (byte offset, inclusive lower bound).

  • bra and ket — the slice marks set by the [ and ] Snowball commands. Replacement and deletion operate on the [bra, ket) range.

Summary

Types

An entry in an among table

A failure return from a primitive. Generated code propagates :fail upward until a try, or or similar combinator catches it.

The result of a Snowball primitive that may succeed or fail.

t()

Functions

Return the buffer contents up to the current limit.

Count the number of Unicode codepoints in a UTF-8 binary.

Test whether the buffer at the cursor begins with string; on success advance the cursor past the match.

Test whether the buffer immediately before the cursor ends with string; on success retreat the cursor by byte_size(string).

Forward among (...) dispatcher. Performs a binary search over entries looking for the longest match at the cursor; on success, advances the cursor past the match and returns {state, result} where result is the matched entry's result value.

Backward among (...) dispatcher.

Scan forward while the codepoint at the cursor is in grouping.

Backward variant of go_in_grouping/2: scan backward while the codepoint before the cursor is in grouping.

Scan forward while the codepoint at the cursor is not in grouping.

Backward variant of go_out_grouping/2: scan backward while the codepoint before the cursor is not in grouping.

Test whether the codepoint at the cursor is a member of grouping; on success, advance the cursor past that codepoint.

Backward variant of in_grouping/2: test the codepoint immediately before the cursor; on success retreat past it.

Insert string at byte range [c_bra, c_ket), adjusting bra and ket if they fall after c_bra.

Create a new stemmer state for the given input word.

Test whether the codepoint at the cursor is not a member of grouping; on success, advance the cursor past that codepoint.

Backward variant of out_grouping/2: test the codepoint immediately before the cursor; on success retreat past it.

Replace the byte range [c_bra, c_ket) in the buffer with string, adjusting cursor and limit accordingly.

Delete the marked slice [bra, ket).

Replace the marked slice [bra, ket) with string.

Return the slice between bra and ket.

Types

among_entry()

@type among_entry() :: {binary(), integer(), integer(), nil | (t() -> result())}

An entry in an among table:

{string, substring_i, result, function_or_nil}

  • string — the literal to match (UTF-8 binary).
  • substring_i — index in the entries list of the longest entry that is a prefix of this one, or -1 if none.
  • result — non-zero integer returned on a successful match.
  • function_or_nil — optional filter function (t() -> t() | :fail).

fail()

@type fail() :: :fail

A failure return from a primitive. Generated code propagates :fail upward until a try, or or similar combinator catches it.

result()

@type result() :: t() | fail()

The result of a Snowball primitive that may succeed or fail.

t()

@type t() :: %Snowball.Runtime{
  bra: non_neg_integer(),
  current: binary(),
  cursor: non_neg_integer(),
  ket: non_neg_integer(),
  limit: non_neg_integer(),
  limit_backward: non_neg_integer(),
  vars: %{optional(atom()) => term()}
}

Functions

assign_to(runtime)

@spec assign_to(t()) :: binary()

Return the buffer contents up to the current limit.

This is the canonical assign_to operation — used as the final result extractor at the end of a stem.

Arguments

  • state is a t/0.

Returns

  • A UTF-8 binary containing the buffer from byte 0 up to limit.

Examples

iex> "hello" |> Snowball.Runtime.new() |> Snowball.Runtime.assign_to()
"hello"

codepoint_length(string)

@spec codepoint_length(binary()) :: non_neg_integer()

Count the number of Unicode codepoints in a UTF-8 binary.

Snowball's len builtin counts codepoints, not grapheme clusters. Elixir's String.length/1 counts grapheme clusters, which differs for scripts that combine base characters with combining marks (Tamil, Hindi, Arabic, etc.). This function correctly counts codepoints by counting lead bytes and single-byte characters in the UTF-8 encoding.

Arguments

  • string is a UTF-8 binary.

Returns

  • The number of Unicode codepoints in string.

Examples

iex> Snowball.Runtime.codepoint_length("hello")
5

iex> Snowball.Runtime.codepoint_length("ஞ்சா")
4

eq_s(state, string)

@spec eq_s(t(), binary()) :: result()

Test whether the buffer at the cursor begins with string; on success advance the cursor past the match.

Mirrors eq_s in the canonical runtime. string is matched as raw UTF-8 bytes — this is sound because Snowball literals are always whole codepoint sequences.

Arguments

  • state is a t/0.

  • string is a UTF-8 binary to match at the cursor.

Returns

  • The updated state with cursor advanced by byte_size(string) on match.

  • :fail if the buffer at the cursor does not start with string, or if the match would cross the forward limit.

Examples

iex> state = Snowball.Runtime.new("running")
iex> %Snowball.Runtime{cursor: 3} = Snowball.Runtime.eq_s(state, "run")
iex> Snowball.Runtime.eq_s(state, "xyz")
:fail

eq_s_b(state, string)

@spec eq_s_b(t(), binary()) :: result()

Test whether the buffer immediately before the cursor ends with string; on success retreat the cursor by byte_size(string).

Mirrors eq_s_b in the canonical runtime — used inside backwards blocks where the scan moves right-to-left.

Arguments

  • state is a t/0.

  • string is a UTF-8 binary to match ending at the cursor.

Returns

  • The updated state with cursor retreated by byte_size(string) on match.

  • :fail if the bytes before the cursor do not equal string, or if the match would cross limit_backward.

Examples

iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 7}
iex> %Snowball.Runtime{cursor: 4} = Snowball.Runtime.eq_s_b(state, "ing")
iex> Snowball.Runtime.eq_s_b(state, "xyz")
:fail

find_among(state, entries)

@spec find_among(t(), [among_entry()]) :: {t(), integer()} | fail()

Forward among (...) dispatcher. Performs a binary search over entries looking for the longest match at the cursor; on success, advances the cursor past the match and returns {state, result} where result is the matched entry's result value.

Arguments

  • state is a t/0.

  • entries is a list of among_entry/0 tuples, sorted lexicographically by string.

Returns

  • {updated_state, result} on a successful match (advance cursor).

  • :fail if no entry matches, or if a matched entry's filter function fails and there is no substring_i fallback.

Examples

iex> entries = [{"ing", -1, 1, nil}, {"ly", -1, 2, nil}]
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 4}
iex> {%Snowball.Runtime{cursor: 7}, 1} = Snowball.Runtime.find_among(state, entries)

find_among_b(state, entries)

@spec find_among_b(t(), [among_entry()]) :: {t(), integer()} | fail()

Backward among (...) dispatcher.

Same shape as find_among/2 but searches before the cursor. On success, retreats the cursor by byte_size(matched_string).

Arguments

Returns

  • {updated_state, result} on success.

  • :fail on no match.

Examples

iex> entries = [{"ing", -1, 1, nil}, {"run", -1, 2, nil}]
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 7, limit_backward: 0}
iex> {%Snowball.Runtime{cursor: 4}, 1} = Snowball.Runtime.find_among_b(state, entries)

go_in_grouping(state, grouping)

@spec go_in_grouping(t(), {integer(), binary(), integer()}) :: result()

Scan forward while the codepoint at the cursor is in grouping.

Used for goto-style commands that consume a run of grouping members. Mirrors go_in_grouping in the canonical runtime.

Arguments

Returns

  • The state with cursor advanced past the run, when at least one non-member codepoint is found before the limit.

  • :fail if the entire remainder up to limit is in the grouping.

Examples

iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("aeibc")
iex> match?(%Snowball.Runtime{cursor: 3}, Snowball.Runtime.go_in_grouping(state, g))
true

go_in_grouping_b(state, grouping)

@spec go_in_grouping_b(t(), {integer(), binary(), integer()}) :: result()

Backward variant of go_in_grouping/2: scan backward while the codepoint before the cursor is in grouping.

Arguments

Returns

  • The state with cursor retreated past the run, when at least one non-member codepoint is found at or above limit_backward.

  • :fail if all codepoints back to limit_backward are in the grouping.

Examples

iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("bcaei")
iex> state = %{state | cursor: 5, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 2}, Snowball.Runtime.go_in_grouping_b(state, g))
true

go_out_grouping(state, grouping)

@spec go_out_grouping(t(), {integer(), binary(), integer()}) :: result()

Scan forward while the codepoint at the cursor is not in grouping.

Mirrors go_out_grouping in the canonical runtime — finds the next grouping member.

Arguments

Returns

  • The state with cursor pointing at a grouping member, when one is found before the limit.

  • :fail if no grouping member is found up to the limit.

Examples

iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("bce")
iex> match?(%Snowball.Runtime{cursor: 2}, Snowball.Runtime.go_out_grouping(state, g))
true

go_out_grouping_b(state, grouping)

@spec go_out_grouping_b(t(), {integer(), binary(), integer()}) :: result()

Backward variant of go_out_grouping/2: scan backward while the codepoint before the cursor is not in grouping.

Arguments

Returns

  • The state with cursor retreated to point past a grouping member, when one is found at or above limit_backward.

  • :fail if no grouping member is found.

Examples

iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("aeibc")
iex> state = %{state | cursor: 5, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 3}, Snowball.Runtime.go_out_grouping_b(state, g))
true

in_grouping(state, grouping)

@spec in_grouping(t(), {integer(), binary(), integer()}) :: result()

Test whether the codepoint at the cursor is a member of grouping; on success, advance the cursor past that codepoint.

Arguments

  • state is a t/0.

  • grouping is a {min_codepoint, bits, max_codepoint} tuple where bits is a binary bit-table indexed by codepoint - min_codepoint.

Returns

  • The state with cursor advanced past the codepoint on a successful match.

  • :fail if the cursor is at the limit, or if the codepoint at the cursor is outside the grouping range or has its bit unset.

Examples

iex> # Grouping {97, <<0b00000101>>, 100} = {a,c} (bits 0,2 set: 97,99)
iex> grouping = {97, <<0b00000101>>, 100}
iex> state = Snowball.Runtime.new("abc")
iex> %Snowball.Runtime{cursor: 1} = Snowball.Runtime.in_grouping(state, grouping)
iex> Snowball.Runtime.in_grouping(%{state | cursor: 1}, grouping)
:fail

in_grouping_b(state, grouping)

@spec in_grouping_b(t(), {integer(), binary(), integer()}) :: result()

Backward variant of in_grouping/2: test the codepoint immediately before the cursor; on success retreat past it.

Arguments

Returns

  • The updated state with cursor retreated by the codepoint's byte size.

  • :fail if the codepoint is not in the grouping, or the cursor is at limit_backward.

Examples

iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 2, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 1}, Snowball.Runtime.in_grouping_b(state, g))
true
iex> state2 = %{state | cursor: 7}
iex> Snowball.Runtime.in_grouping_b(state2, g)
:fail

insert(state, c_bra, c_ket, string)

@spec insert(t(), non_neg_integer(), non_neg_integer(), binary()) :: t()

Insert string at byte range [c_bra, c_ket), adjusting bra and ket if they fall after c_bra.

Mirrors insert_s / insert in the canonical runtime. Used by insert and attach Snowball commands.

Arguments

  • state is a t/0.

  • c_bra is the inclusive start of the range (byte offset).

  • c_ket is the exclusive end of the range (byte offset).

  • string is the UTF-8 binary to insert.

Returns

  • The updated state.

Examples

iex> state = Snowball.Runtime.new("abcdef")
iex> %Snowball.Runtime{current: "abXYZdef"} = Snowball.Runtime.insert(state, 2, 3, "XYZ")

new(word)

@spec new(binary()) :: t()

Create a new stemmer state for the given input word.

Arguments

  • word is the input word as a UTF-8 binary.

Returns

  • A t/0 initialised with cursor at the start, limit at the end, limit_backward at the start, and bra / ket at cursor / limit respectively (matching BaseStemmer.set_current in canonical Snowball).

Examples

iex> Snowball.Runtime.new("running")
%Snowball.Runtime{current: "running", cursor: 0, limit: 7, limit_backward: 0, bra: 0, ket: 7, vars: %{}}

out_grouping(state, grouping)

@spec out_grouping(t(), {integer(), binary(), integer()}) :: result()

Test whether the codepoint at the cursor is not a member of grouping; on success, advance the cursor past that codepoint.

Arguments

  • state is a t/0.

  • grouping is a {min_codepoint, bits, max_codepoint} tuple.

Returns

  • The state with cursor advanced past the codepoint on a successful non-match.

  • :fail at limit or when the codepoint is in the grouping.

Examples

iex> grouping = {97, <<0b00010101>>, 101}
iex> state = Snowball.Runtime.new("bace")
iex> %Snowball.Runtime{cursor: 1} = Snowball.Runtime.out_grouping(state, grouping)

out_grouping_b(state, grouping)

@spec out_grouping_b(t(), {integer(), binary(), integer()}) :: result()

Backward variant of out_grouping/2: test the codepoint immediately before the cursor; on success retreat past it.

Arguments

Returns

  • The updated state with cursor retreated by the codepoint's byte size.

  • :fail if the codepoint is in the grouping, or the cursor is at limit_backward.

Examples

iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 7, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 6}, Snowball.Runtime.out_grouping_b(state, g))
true
iex> state2 = %{state | cursor: 2}
iex> Snowball.Runtime.out_grouping_b(state2, g)
:fail

replace_s(state, c_bra, c_ket, string)

@spec replace_s(t(), non_neg_integer(), non_neg_integer(), binary()) ::
  {t(), integer()}

Replace the byte range [c_bra, c_ket) in the buffer with string, adjusting cursor and limit accordingly.

Mirrors replace_s in the canonical runtime. Returns the updated state along with the size adjustment (positive if the replacement grew the buffer, negative if it shrank).

Arguments

  • state is a t/0.

  • c_bra is the inclusive start of the range to replace (byte offset).

  • c_ket is the exclusive end of the range (byte offset).

  • string is the UTF-8 binary replacement.

Returns

  • {state, adjustment} where adjustment is byte_size(string) - (c_ket - c_bra).

Examples

iex> state = Snowball.Runtime.new("abcdef")
iex> {%Snowball.Runtime{current: "abXYZdef", limit: 8}, 2} =
...>   Snowball.Runtime.replace_s(state, 2, 3, "XYZ")

slice_del(state)

@spec slice_del(t()) :: t()

Delete the marked slice [bra, ket).

Equivalent to slice_from(state, ""). Mirrors slice_del in the canonical runtime.

Arguments

  • state is a t/0.

Returns

  • The updated state with [bra, ket) removed.

Examples

iex> state = Snowball.Runtime.new("running")
iex> state = %{state | bra: 3, ket: 7}
iex> %Snowball.Runtime{current: "run", limit: 3} = Snowball.Runtime.slice_del(state)

slice_from(state, string)

@spec slice_from(t(), binary()) :: t()

Replace the marked slice [bra, ket) with string.

Mirrors slice_from in the canonical runtime. After replacement, ket is moved to bra + byte_size(string).

Arguments

  • state is a t/0.

  • string is the UTF-8 binary replacement.

Returns

  • The updated state.

Examples

iex> state = Snowball.Runtime.new("running")
iex> state = %{state | bra: 4, ket: 7}
iex> %Snowball.Runtime{current: "runneed"} = Snowball.Runtime.slice_from(state, "eed")

slice_to(runtime)

@spec slice_to(t()) :: binary()

Return the slice between bra and ket.

Arguments

  • state is a t/0.

Returns

  • A UTF-8 binary containing the bytes from bra (inclusive) to ket (exclusive).

Examples

iex> state = Snowball.Runtime.new("abcdef")
iex> state = %{state | bra: 1, ket: 4}
iex> Snowball.Runtime.slice_to(state)
"bcd"