Pure, stateless tokenizer for VT/ANSI escape sequences.
Takes a binary and returns a list of typed tokens plus a leftover binary
containing any trailing bytes that form an incomplete sequence. The leftover
is meant to be prepended to the next chunk of input (see Vtex.Input.Stream).
The tokenizer implements
Paul Williams' ANSI parser state machine
as faithfully as a stateless tokenizer can. Within a control sequence it
reproduces the csi_entry / csi_param / csi_intermediate / csi_ignore
sub-states — parameter and intermediate bytes are collected into separate
buffers, and a malformed ordering drops into csi_ignore, which swallows the
rest of the sequence and emits nothing, exactly as the diagram specifies. It
also honours the diagram's "anywhere" transitions: an ESC restarts parsing,
CAN/SUB abort the sequence, other C0 controls are passed through (executed
in place) while the sequence continues, and DEL is ignored. It has no state
and no side effects, so it is fully testable with raw strings.
A handful of details still differ from the reference parser — see "Deviations from the reference parser" below.
Token types
{:text, binary()} # run of printable / control bytes
{:csi, params, intermediates, final} # ESC [ <params> <intermediates> <final>
{:ss3, byte()} # ESC O X
{:osc, payload :: binary()} # ESC ] ... ST
{:esc, byte()} # ESC <other>
{:invalid, binary()} # failed / rejected sequenceFor :csi, params and intermediates are binaries and final is a byte.
For example ESC [ ? 2 5 h tokenizes to {:csi, "25", "?", ?h} — the private
marker ? lands in intermediates, the digits in params.
Truncated sequences are never emitted as tokens; they are returned as the leftover binary so the caller can buffer them until more bytes arrive.
Byte ranges
- CSI parameter bytes:
0x30..0x39and;— collected intoparams - CSI intermediate / private-marker bytes:
0x20..0x2Fand0x3C..0x3F— collected intointermediates - CSI final byte:
0x40..0x7E - SS3:
ESC O <byte>— always exactly three bytes - OSC:
ESC ] <payload> STwhere ST isESC \orBEL(0x07)
DCS, APC, PM and SOS strings are recognised and immediately rejected
as {:invalid, ...} — no game/BBS context needs them, and their unbounded
payloads are a denial-of-service vector.
Deviations from the reference parser
This is a tokenizer, not a terminal, so a few parts of the diagram are intentionally absent or differ:
- The "anywhere" transitions are honoured only inside a CSI.
ESC,CAN,SUBand C0 controls get no special treatment inside OSC/string scanning or in theescapestate itself — e.g.ESC ESCis emitted as a lone{:esc, 0x1B}rather than re-entering the escape state, and a stray control inside an OSC payload is copied verbatim. - The diagram is 7-bit; colon is reserved. Following the original
diagram, a colon (
0x3A) inside a CSI drops the sequence intocsi_ignore. Terminals that use the ITU colon syntax for sub-parameters (e.g.ESC [ 38 : 2 : r : g : b m) will therefore have those sequences discarded — use the semicolon forms instead. - No 8-bit C1 controls. Only the 7-bit
ESC-prefixed forms are recognised; bytes0x80..0x9Fare treated as ordinary text, which is what UTF-8 continuation bytes require anyway. A0x80..0xFFbyte appearing mid-CSI drops the sequence intocsi_ignore. DCS,APC,PMandSOSare rejected, not parsed (see above).- OSC also accepts a
BELterminator, an xterm extension the strict diagram does not include. - Truncation defers C0 execution. Because incomplete sequences are buffered as leftover, a C0 control arriving mid-sequence is not emitted until the sequence completes; the reference parser would execute it immediately.
Summary
Functions
Tokenize a binary into a list of tokens and a leftover binary.
Types
Functions
Tokenize a binary into a list of tokens and a leftover binary.
The leftover is the unconsumed tail when input ends mid-sequence. Pass it back prepended to the next chunk to resume parsing.
Examples
iex> Vtex.Input.Tokenizer.tokenize("hi")
{[{:text, "hi"}], ""}
iex> Vtex.Input.Tokenizer.tokenize(<<0x1B, ?[, ?A>>)
{[{:csi, "", "", ?A}], ""}
iex> Vtex.Input.Tokenizer.tokenize(<<0x1B, ?[, "?25h">>)
{[{:csi, "25", "?", ?h}], ""}
iex> Vtex.Input.Tokenizer.tokenize(<<?a, 0x1B, ?[>>)
{[{:text, "a"}], <<0x1B, ?[>>}