Runtime state and primitive operations for Snowball stemmers.
This is the Elixir analogue of the canonical BaseStemmer class found in
the Python and JavaScript snowball runtimes. Generated stemmer modules
call into the functions on this module to manipulate cursor position,
test character groupings, dispatch to suffix tables (among), and
perform slice replacements.
State model
Snowball is conceptually mutable — every command moves the cursor or
rewrites the buffer. In Elixir we thread an immutable %Runtime{}
through every primitive. Each primitive returns either the updated
struct (success) or the atom :fail (failure). Generated code uses
pattern matches like:
case Runtime.eq_s(state, "ing") do
:fail -> :fail
state -> state
endto compose primitives.
Cursor units
All cursor positions and limits are byte offsets into the UTF-8
buffer. This matches the canonical -utf8 mode of the Snowball
reference compiler and lets eq_s/2 perform a direct byte-level
prefix match (UTF-8 is self-synchronising, so byte positions land on
codepoint boundaries provided cursor moves are made via the public
primitives).
State fields
current— the working buffer (UTF-8 binary).cursor— the current scan position (byte offset).limit— the forward limit (byte offset, exclusive).limit_backward— the backward limit (byte offset, inclusive lower bound).braandket— the slice marks set by the[and]Snowball commands. Replacement and deletion operate on the[bra, ket)range.
Summary
Types
An entry in an among table
A failure return from a primitive. Generated code propagates :fail
upward until a try, or or similar combinator catches it.
The result of a Snowball primitive that may succeed or fail.
Functions
Return the buffer contents up to the current limit.
Count the number of Unicode codepoints in a UTF-8 binary.
Test whether the buffer at the cursor begins with string; on success
advance the cursor past the match.
Test whether the buffer immediately before the cursor ends with
string; on success retreat the cursor by byte_size(string).
Forward among (...) dispatcher. Performs a binary search over
entries looking for the longest match at the cursor; on success,
advances the cursor past the match and returns {state, result}
where result is the matched entry's result value.
Backward among (...) dispatcher.
Scan forward while the codepoint at the cursor is in grouping.
Backward variant of go_in_grouping/2: scan backward while the
codepoint before the cursor is in grouping.
Scan forward while the codepoint at the cursor is not in
grouping.
Backward variant of go_out_grouping/2: scan backward while the
codepoint before the cursor is not in grouping.
Test whether the codepoint at the cursor is a member of grouping;
on success, advance the cursor past that codepoint.
Backward variant of in_grouping/2: test the codepoint immediately
before the cursor; on success retreat past it.
Insert string at byte range [c_bra, c_ket), adjusting bra and
ket if they fall after c_bra.
Create a new stemmer state for the given input word.
Test whether the codepoint at the cursor is not a member of
grouping; on success, advance the cursor past that codepoint.
Backward variant of out_grouping/2: test the codepoint immediately
before the cursor; on success retreat past it.
Replace the byte range [c_bra, c_ket) in the buffer with string,
adjusting cursor and limit accordingly.
Delete the marked slice [bra, ket).
Replace the marked slice [bra, ket) with string.
Return the slice between bra and ket.
Types
An entry in an among table:
{string, substring_i, result, function_or_nil}
string— the literal to match (UTF-8 binary).substring_i— index in the entries list of the longest entry that is a prefix of this one, or-1if none.result— non-zero integer returned on a successful match.function_or_nil— optional filter function(t() -> t() | :fail).
@type fail() :: :fail
A failure return from a primitive. Generated code propagates :fail
upward until a try, or or similar combinator catches it.
The result of a Snowball primitive that may succeed or fail.
@type t() :: %Snowball.Runtime{ bra: non_neg_integer(), current: binary(), cursor: non_neg_integer(), ket: non_neg_integer(), limit: non_neg_integer(), limit_backward: non_neg_integer(), vars: %{optional(atom()) => term()} }
Functions
Return the buffer contents up to the current limit.
This is the canonical assign_to operation — used as the final
result extractor at the end of a stem.
Arguments
stateis at/0.
Returns
- A UTF-8 binary containing the buffer from byte 0 up to
limit.
Examples
iex> "hello" |> Snowball.Runtime.new() |> Snowball.Runtime.assign_to()
"hello"
@spec codepoint_length(binary()) :: non_neg_integer()
Count the number of Unicode codepoints in a UTF-8 binary.
Snowball's len builtin counts codepoints, not grapheme clusters.
Elixir's String.length/1 counts grapheme clusters, which differs for
scripts that combine base characters with combining marks (Tamil,
Hindi, Arabic, etc.). This function correctly counts codepoints by
counting lead bytes and single-byte characters in the UTF-8 encoding.
Arguments
stringis a UTF-8 binary.
Returns
- The number of Unicode codepoints in
string.
Examples
iex> Snowball.Runtime.codepoint_length("hello")
5
iex> Snowball.Runtime.codepoint_length("ஞ்சா")
4
Test whether the buffer at the cursor begins with string; on success
advance the cursor past the match.
Mirrors eq_s in the canonical runtime. string is matched as raw
UTF-8 bytes — this is sound because Snowball literals are always
whole codepoint sequences.
Arguments
stateis at/0.stringis a UTF-8 binary to match at the cursor.
Returns
The updated state with cursor advanced by
byte_size(string)on match.:failif the buffer at the cursor does not start withstring, or if the match would cross the forwardlimit.
Examples
iex> state = Snowball.Runtime.new("running")
iex> %Snowball.Runtime{cursor: 3} = Snowball.Runtime.eq_s(state, "run")
iex> Snowball.Runtime.eq_s(state, "xyz")
:fail
Test whether the buffer immediately before the cursor ends with
string; on success retreat the cursor by byte_size(string).
Mirrors eq_s_b in the canonical runtime — used inside backwards
blocks where the scan moves right-to-left.
Arguments
stateis at/0.stringis a UTF-8 binary to match ending at the cursor.
Returns
The updated state with cursor retreated by
byte_size(string)on match.:failif the bytes before the cursor do not equalstring, or if the match would crosslimit_backward.
Examples
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 7}
iex> %Snowball.Runtime{cursor: 4} = Snowball.Runtime.eq_s_b(state, "ing")
iex> Snowball.Runtime.eq_s_b(state, "xyz")
:fail
@spec find_among(t(), [among_entry()]) :: {t(), integer()} | fail()
Forward among (...) dispatcher. Performs a binary search over
entries looking for the longest match at the cursor; on success,
advances the cursor past the match and returns {state, result}
where result is the matched entry's result value.
Arguments
stateis at/0.entriesis a list ofamong_entry/0tuples, sorted lexicographically bystring.
Returns
{updated_state, result}on a successful match (advance cursor).:failif no entry matches, or if a matched entry's filter function fails and there is nosubstring_ifallback.
Examples
iex> entries = [{"ing", -1, 1, nil}, {"ly", -1, 2, nil}]
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 4}
iex> {%Snowball.Runtime{cursor: 7}, 1} = Snowball.Runtime.find_among(state, entries)
@spec find_among_b(t(), [among_entry()]) :: {t(), integer()} | fail()
Backward among (...) dispatcher.
Same shape as find_among/2 but searches before the cursor. On
success, retreats the cursor by byte_size(matched_string).
Arguments
stateis at/0.entriesis a sorted list ofamong_entry/0tuples.
Returns
{updated_state, result}on success.:failon no match.
Examples
iex> entries = [{"ing", -1, 1, nil}, {"run", -1, 2, nil}]
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 7, limit_backward: 0}
iex> {%Snowball.Runtime{cursor: 4}, 1} = Snowball.Runtime.find_among_b(state, entries)
Scan forward while the codepoint at the cursor is in grouping.
Used for goto-style commands that consume a run of grouping
members. Mirrors go_in_grouping in the canonical runtime.
Arguments
stateis at/0.groupingis a{min_cp, bits, max_cp}table fromSnowball.Grouping.
Returns
The state with cursor advanced past the run, when at least one non-member codepoint is found before the limit.
:failif the entire remainder up to limit is in the grouping.
Examples
iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("aeibc")
iex> match?(%Snowball.Runtime{cursor: 3}, Snowball.Runtime.go_in_grouping(state, g))
true
Backward variant of go_in_grouping/2: scan backward while the
codepoint before the cursor is in grouping.
Arguments
stateis at/0.groupingis a{min_cp, bits, max_cp}table fromSnowball.Grouping.
Returns
The state with cursor retreated past the run, when at least one non-member codepoint is found at or above
limit_backward.:failif all codepoints back tolimit_backwardare in the grouping.
Examples
iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("bcaei")
iex> state = %{state | cursor: 5, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 2}, Snowball.Runtime.go_in_grouping_b(state, g))
true
Scan forward while the codepoint at the cursor is not in
grouping.
Mirrors go_out_grouping in the canonical runtime — finds the next
grouping member.
Arguments
stateis at/0.groupingis a{min_cp, bits, max_cp}table fromSnowball.Grouping.
Returns
The state with cursor pointing at a grouping member, when one is found before the limit.
:failif no grouping member is found up to the limit.
Examples
iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("bce")
iex> match?(%Snowball.Runtime{cursor: 2}, Snowball.Runtime.go_out_grouping(state, g))
true
Backward variant of go_out_grouping/2: scan backward while the
codepoint before the cursor is not in grouping.
Arguments
stateis at/0.groupingis a{min_cp, bits, max_cp}table fromSnowball.Grouping.
Returns
The state with cursor retreated to point past a grouping member, when one is found at or above
limit_backward.:failif no grouping member is found.
Examples
iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("aeibc")
iex> state = %{state | cursor: 5, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 3}, Snowball.Runtime.go_out_grouping_b(state, g))
true
Test whether the codepoint at the cursor is a member of grouping;
on success, advance the cursor past that codepoint.
Arguments
stateis at/0.groupingis a{min_codepoint, bits, max_codepoint}tuple wherebitsis a binary bit-table indexed bycodepoint - min_codepoint.
Returns
The state with cursor advanced past the codepoint on a successful match.
:failif the cursor is at the limit, or if the codepoint at the cursor is outside the grouping range or has its bit unset.
Examples
iex> # Grouping {97, <<0b00000101>>, 100} = {a,c} (bits 0,2 set: 97,99)
iex> grouping = {97, <<0b00000101>>, 100}
iex> state = Snowball.Runtime.new("abc")
iex> %Snowball.Runtime{cursor: 1} = Snowball.Runtime.in_grouping(state, grouping)
iex> Snowball.Runtime.in_grouping(%{state | cursor: 1}, grouping)
:fail
Backward variant of in_grouping/2: test the codepoint immediately
before the cursor; on success retreat past it.
Arguments
stateis at/0.groupingis a{min_cp, bits, max_cp}table fromSnowball.Grouping.
Returns
The updated state with cursor retreated by the codepoint's byte size.
:failif the codepoint is not in the grouping, or the cursor is atlimit_backward.
Examples
iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 2, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 1}, Snowball.Runtime.in_grouping_b(state, g))
true
iex> state2 = %{state | cursor: 7}
iex> Snowball.Runtime.in_grouping_b(state2, g)
:fail
@spec insert(t(), non_neg_integer(), non_neg_integer(), binary()) :: t()
Insert string at byte range [c_bra, c_ket), adjusting bra and
ket if they fall after c_bra.
Mirrors insert_s / insert in the canonical runtime. Used by
insert and attach Snowball commands.
Arguments
stateis at/0.c_brais the inclusive start of the range (byte offset).c_ketis the exclusive end of the range (byte offset).stringis the UTF-8 binary to insert.
Returns
- The updated state.
Examples
iex> state = Snowball.Runtime.new("abcdef")
iex> %Snowball.Runtime{current: "abXYZdef"} = Snowball.Runtime.insert(state, 2, 3, "XYZ")
Create a new stemmer state for the given input word.
Arguments
wordis the input word as a UTF-8 binary.
Returns
- A
t/0initialised withcursorat the start,limitat the end,limit_backwardat the start, andbra/ketat cursor / limit respectively (matchingBaseStemmer.set_currentin canonical Snowball).
Examples
iex> Snowball.Runtime.new("running")
%Snowball.Runtime{current: "running", cursor: 0, limit: 7, limit_backward: 0, bra: 0, ket: 7, vars: %{}}
Test whether the codepoint at the cursor is not a member of
grouping; on success, advance the cursor past that codepoint.
Arguments
stateis at/0.groupingis a{min_codepoint, bits, max_codepoint}tuple.
Returns
The state with cursor advanced past the codepoint on a successful non-match.
:failat limit or when the codepoint is in the grouping.
Examples
iex> grouping = {97, <<0b00010101>>, 101}
iex> state = Snowball.Runtime.new("bace")
iex> %Snowball.Runtime{cursor: 1} = Snowball.Runtime.out_grouping(state, grouping)
Backward variant of out_grouping/2: test the codepoint immediately
before the cursor; on success retreat past it.
Arguments
stateis at/0.groupingis a{min_cp, bits, max_cp}table fromSnowball.Grouping.
Returns
The updated state with cursor retreated by the codepoint's byte size.
:failif the codepoint is in the grouping, or the cursor is atlimit_backward.
Examples
iex> g = Snowball.Grouping.from_string("aeiou")
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | cursor: 7, limit_backward: 0}
iex> match?(%Snowball.Runtime{cursor: 6}, Snowball.Runtime.out_grouping_b(state, g))
true
iex> state2 = %{state | cursor: 2}
iex> Snowball.Runtime.out_grouping_b(state2, g)
:fail
@spec replace_s(t(), non_neg_integer(), non_neg_integer(), binary()) :: {t(), integer()}
Replace the byte range [c_bra, c_ket) in the buffer with string,
adjusting cursor and limit accordingly.
Mirrors replace_s in the canonical runtime. Returns the updated
state along with the size adjustment (positive if the replacement
grew the buffer, negative if it shrank).
Arguments
stateis at/0.c_brais the inclusive start of the range to replace (byte offset).c_ketis the exclusive end of the range (byte offset).stringis the UTF-8 binary replacement.
Returns
{state, adjustment}whereadjustmentisbyte_size(string) - (c_ket - c_bra).
Examples
iex> state = Snowball.Runtime.new("abcdef")
iex> {%Snowball.Runtime{current: "abXYZdef", limit: 8}, 2} =
...> Snowball.Runtime.replace_s(state, 2, 3, "XYZ")
Delete the marked slice [bra, ket).
Equivalent to slice_from(state, ""). Mirrors slice_del in the
canonical runtime.
Arguments
stateis at/0.
Returns
- The updated state with
[bra, ket)removed.
Examples
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | bra: 3, ket: 7}
iex> %Snowball.Runtime{current: "run", limit: 3} = Snowball.Runtime.slice_del(state)
Replace the marked slice [bra, ket) with string.
Mirrors slice_from in the canonical runtime. After replacement,
ket is moved to bra + byte_size(string).
Arguments
stateis at/0.stringis the UTF-8 binary replacement.
Returns
- The updated state.
Examples
iex> state = Snowball.Runtime.new("running")
iex> state = %{state | bra: 4, ket: 7}
iex> %Snowball.Runtime{current: "runneed"} = Snowball.Runtime.slice_from(state, "eed")
Return the slice between bra and ket.
Arguments
stateis at/0.
Returns
- A UTF-8 binary containing the bytes from
bra(inclusive) toket(exclusive).
Examples
iex> state = Snowball.Runtime.new("abcdef")
iex> state = %{state | bra: 1, ket: 4}
iex> Snowball.Runtime.slice_to(state)
"bcd"