Vtex.Input.Tokenizer (Vtex v0.1.0)

Copy Markdown View Source

Pure, stateless tokenizer for VT/ANSI escape sequences.

Takes a binary and returns a list of typed tokens plus a leftover binary containing any trailing bytes that form an incomplete sequence. The leftover is meant to be prepended to the next chunk of input (see Vtex.Input.Stream).

The tokenizer implements Paul Williams' ANSI parser state machine as faithfully as a stateless tokenizer can. Within a control sequence it reproduces the csi_entry / csi_param / csi_intermediate / csi_ignore sub-states — parameter and intermediate bytes are collected into separate buffers, and a malformed ordering drops into csi_ignore, which swallows the rest of the sequence and emits nothing, exactly as the diagram specifies. It also honours the diagram's "anywhere" transitions: an ESC restarts parsing, CAN/SUB abort the sequence, other C0 controls are passed through (executed in place) while the sequence continues, and DEL is ignored. It has no state and no side effects, so it is fully testable with raw strings.

A handful of details still differ from the reference parser — see "Deviations from the reference parser" below.

Token types

{:text, binary()}                          # run of printable / control bytes
{:csi, params, intermediates, final}       # ESC [ <params> <intermediates> <final>
{:ss3, byte()}                             # ESC O X
{:osc, payload :: binary()}                # ESC ] ... ST
{:esc, byte()}                             # ESC <other>
{:invalid, binary()}                       # failed / rejected sequence

For :csi, params and intermediates are binaries and final is a byte. For example ESC [ ? 2 5 h tokenizes to {:csi, "25", "?", ?h} — the private marker ? lands in intermediates, the digits in params.

Truncated sequences are never emitted as tokens; they are returned as the leftover binary so the caller can buffer them until more bytes arrive.

Byte ranges

  • CSI parameter bytes: 0x30..0x39 and ; — collected into params
  • CSI intermediate / private-marker bytes: 0x20..0x2F and 0x3C..0x3F — collected into intermediates
  • CSI final byte: 0x40..0x7E
  • SS3: ESC O <byte> — always exactly three bytes
  • OSC: ESC ] <payload> ST where ST is ESC \ or BEL (0x07)

DCS, APC, PM and SOS strings are recognised and immediately rejected as {:invalid, ...} — no game/BBS context needs them, and their unbounded payloads are a denial-of-service vector.

Deviations from the reference parser

This is a tokenizer, not a terminal, so a few parts of the diagram are intentionally absent or differ:

  • The "anywhere" transitions are honoured only inside a CSI. ESC, CAN, SUB and C0 controls get no special treatment inside OSC/string scanning or in the escape state itself — e.g. ESC ESC is emitted as a lone {:esc, 0x1B} rather than re-entering the escape state, and a stray control inside an OSC payload is copied verbatim.
  • The diagram is 7-bit; colon is reserved. Following the original diagram, a colon (0x3A) inside a CSI drops the sequence into csi_ignore. Terminals that use the ITU colon syntax for sub-parameters (e.g. ESC [ 38 : 2 : r : g : b m) will therefore have those sequences discarded — use the semicolon forms instead.
  • No 8-bit C1 controls. Only the 7-bit ESC-prefixed forms are recognised; bytes 0x80..0x9F are treated as ordinary text, which is what UTF-8 continuation bytes require anyway. A 0x80..0xFF byte appearing mid-CSI drops the sequence into csi_ignore.
  • DCS, APC, PM and SOS are rejected, not parsed (see above).
  • OSC also accepts a BEL terminator, an xterm extension the strict diagram does not include.
  • Truncation defers C0 execution. Because incomplete sequences are buffered as leftover, a C0 control arriving mid-sequence is not emitted until the sequence completes; the reference parser would execute it immediately.

Summary

Functions

Tokenize a binary into a list of tokens and a leftover binary.

Types

token()

@type token() ::
  {:text, binary()}
  | {:csi, binary(), binary(), byte()}
  | {:ss3, byte()}
  | {:osc, binary()}
  | {:esc, byte()}
  | {:invalid, binary()}

Functions

tokenize(data)

@spec tokenize(binary()) :: {[token()], binary()}

Tokenize a binary into a list of tokens and a leftover binary.

The leftover is the unconsumed tail when input ends mid-sequence. Pass it back prepended to the next chunk to resume parsing.

Examples

iex> Vtex.Input.Tokenizer.tokenize("hi")
{[{:text, "hi"}], ""}

iex> Vtex.Input.Tokenizer.tokenize(<<0x1B, ?[, ?A>>)
{[{:csi, "", "", ?A}], ""}

iex> Vtex.Input.Tokenizer.tokenize(<<0x1B, ?[, "?25h">>)
{[{:csi, "25", "?", ?h}], ""}

iex> Vtex.Input.Tokenizer.tokenize(<<?a, 0x1B, ?[>>)
{[{:text, "a"}], <<0x1B, ?[>>}