Linx.NFT.Tokenizer (Linx v0.1.0)

Copy Markdown View Source

Char-by-char lexer for the ~NFT sigil and .nft files.

Mirrors the architecture of Phoenix.LiveView.TagEngine.Tokenizer and of nft's own src/scanner.l: an explicit stack of start conditions (lex states) lets context-sensitive constructs add a new state without disturbing the rest of the lexer.

The conditions in play:

  • :default — top-level lexing of keywords, identifiers, literals, operators, punctuation, statement separators.
  • :line_comment# to end of line.
  • :block_comment/* ... */; supports nesting (nft itself doesn't, but supporting nesting costs ~5 lines and prevents a real footgun on hand-edited files).
  • :string"..." with \\/\"/\n/\t/\r/\0 escapes. (String-internal Elixir interpolation is not yet supported — it'll push :elixir_expr from :string when added, no other change required.)
  • :elixir_expr — only enterable when the :interpolation? option is true. Scans an Elixir expression up to the matching }, skipping } characters that appear inside strings/charlists/comments inside the expression.

Token shape

Each token is a 2- or 3-tuple:

{:kind, meta}                # punctuation with no payload
{:kind, value, meta}         # everything else

where meta is %{line: pos_integer(), column: pos_integer()} pointing at the start of the token.

Identifiers are emitted as {:identifier, "name", meta} — the parser decides which names are keywords. (Pattern-matching on binaries is ergonomic in Elixir; this avoids a 200-entry keyword table here.)

Statement separators

In nft syntax, statements inside a { ... } body are separated by either ; or a newline. To keep parsing simple, the tokenizer emits a single :stmt_sep token for every ; and for every (possibly multi-line) run of newlines, collapsing consecutive separators into one. Newlines that appear inside brackets are still emitted — the parser ignores spurious separators in positions where they're not meaningful.

Line continuations (\\\n) are consumed silently.

Numeric / address literals

Network primitives need a small lookahead to disambiguate:

  • 0x... / 0X... — hex integer.
  • 0b... / 0B... — binary integer.
  • \d+ followed by no . or : or / — plain decimal integer.
  • \d+\.\d+\.\d+\.\d+ — IPv4 literal (optional /N CIDR).
  • IPv6: any run starting with hex chars that contains : and whose contents are valid IPv6 syntax.
  • MAC: six 2-char hex octets joined by :.

Identifiers that happen to begin with hex letters (e.g. eth0 or even fe80) are still tagged as identifiers when not followed by :. If the identifier is all-hex and followed by : plus a hex char, the lexer rewinds and re-scans as an IPv6/MAC literal.

Errors

Anything the tokenizer can't classify raises a Linx.NFT.ParseError with {file, line, column} and the offending source line. The caller (sigil macro, parse/1, parse_file/1) catches and either re-raises (compile-time) or returns {:error, %ParseError{}}.

Extensibility

All architectural decisions here were chosen for incremental extension, since the supported grammar is the common ~85% subset and the long tail of nft constructs (synproxy, secmark, osf, fib, jhash, advanced ct, dup/fwd, tproxy, xfrm, tunnel) will be added per-construct over time. Each addition becomes:

  1. (Optional) a new start condition pushed from somewhere in :default — add a clause and a step function.
  2. (Optional) a new token kind — extend the @type token union and the parser's pattern matches.

The stack discipline means none of these touch existing conditions.

Summary

Functions

Tokenizes source into a flat list of tokens.

Types

token()

@type token() ::
  {:identifier, String.t(), token_meta()}
  | {:integer, integer(), token_meta()}
  | {:string, String.t(), token_meta()}
  | {:ipv4, String.t(), token_meta()}
  | {:ipv6, String.t(), token_meta()}
  | {:mac, String.t(), token_meta()}
  | {:cidr_v4, String.t(), token_meta()}
  | {:cidr_v6, String.t(), token_meta()}
  | {:elixir_expr, String.t(), token_meta()}
  | {:stmt_sep, token_meta()}
  | {atom(), token_meta()}

token_meta()

@type token_meta() :: %{line: pos_integer(), column: pos_integer()}

Functions

tokenize(source, opts \\ [])

@spec tokenize(
  String.t(),
  keyword()
) :: {:ok, [token()]} | {:error, Linx.NFT.ParseError.t()}

Tokenizes source into a flat list of tokens.

Options

  • :file — source filename for error messages (default "nofile").
  • :line — starting line number (default 1); useful when called from a ~NFT sigil with __CALLER__.line to make error locations line up with the surrounding .ex source.
  • :column — starting column number (default 1).
  • :interpolation? — whether to recognize #{...} Elixir interpolation (default false). The sigil sets this to true; parse/1 / parse_file/1 leave it false.

Returns {:ok, tokens} or {:error, %Linx.NFT.ParseError{}}.