View Source Grammar (Grammar v0.4.0)

This module exposes functions and macros to create parsers of structured inputs. Parsers are defined as LL(1) grammars.

A grammar must be LL(1), i.e. must be unambiguous and have a single token lookahead.

One can create parsers at runtime, using the Grammar functions, or define parsers at compile time using the Grammar macros.

The grammar is defined by a set of rules, each rule being a set of clauses. Clauses must be understood as disjoinded paths in the rule resolution, or as the or operator in the classic notation.

The tokenization process relies on the TokenExtractor protocol, which is used to extract tokens from the input string. This protocol is implemented for BitString and Regex, and can be extended to custom token types.

Spaces and line breaks handling

By default, the tokenizer will drop spaces and line breaks.

If you want to keep them, you can pass the drop_spaces: false option to the use Grammar macro.

When using the Tokenizer directly, you must pass false as second parameter to Tokenizer.new/2 to keep spaces and line breaks.

In this case, you are fully responsible for handling spaces and line breaks in your rules.

Bit level pattern matching

TODO:

Using the API

Creating a new grammar is done by calling Grammar.new/0, then adding clauses to it using Grammar.add_clause/5.

Once all the rules are defined, the grammar must be prepared using Grammar.prepare/1.

The grammar is then ready to be used for parsing, by calling Grammar.start/2 with the starting rule name, and then Grammar.loop/2 with the input string wrapped in a Tokenizer.

Example

iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["hello", :what], fn ["hello", what] -> "hello #{what} !" end)
...> |> Grammar.add_clause(:what, [~r/[a-zA-Z]+/], fn [ident] -> ident end)
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
iex> Grammar.loop(g, Grammar.Tokenizer.new("hello world"))
{:ok, "hello world !"}

Using the DSL

To declare a parser module, just use the Grammar module in your module, and define your rules using the rule/2 and rule?/2 macro. The newly defined module will expose a parse/1 function that will parse the input string, and return a tuple with the tokenizer in its final state, and result.

See rule/2 for a full example.

Options

  • drop_spaces: true (default): if set to false, the tokenizer will not drop spaces and line breaks.
  • sub_byte_matching: false (default): if set to true, the tokenizer will match tokens at binary-level.

Example

In the following MyModuleKO module, the start rule doesn't handle spaces and line breaks, so it will fail if the input contains them.

iex> defmodule MyModuleKO do
...>   use Grammar, drop_spaces: false
...>
...>   # spaces and linebreaks not handled
...>
...>   rule start("hello", "world") do
...>     [_hello, _world] = params
...>     "hello world"
...>   end
...> end
iex> MyModuleKO.parse("helloworld")
{:ok, "hello world"}
iex> MyModuleKO.parse("hello world")
{:error, {1, 6}, :no_token}

But in the MyModuleOK module, the start explicitly handles spaces and line breaks between "hello" and "world".

And even more that rule definition requires at least one space between "hello" and "world", parsing fails if no space is found.

iex> defmodule MyModuleOK do
...>   use Grammar, drop_spaces: false
...>
...>   # spaces and linebreaks not handled
...>
...>   rule start("hello", ~r/[\s]+/, "world") do
...>     [_hello, _spaces, _world] = params
...>     "hello world"
...>   end
...> end
iex> MyModuleOK.parse("helloworld")
{:error, {1, 6}, :no_token}
iex> MyModuleOK.parse(~s/hello  \t world/)
{:ok, "hello world"}

Bit level pattern matching

Please see the dedicated livebook for more information.

Summary

Functions

Process the input tokenizer, until completion or error.

Create a new empty Grammar.

Use this function after defining all the rules of the grammar to prepare the grammar for parsing.

Same as prepare/1 but raises a RuntimeError if an error is found.

Use this macro to define rules of your grammar.

Same as rule/2 but relaxed : if the rule cannot be matched, it will be valued as nil.

Reset the grammar to its initial state, and set the starting rule name.

Step once in the input Tokenizer, using grammar.

Types

@type error() :: {error_type(), term()}
@type error_type() :: :no_clause_matched | :no_token
@type first() :: term()
@type rules() :: %{required(Grammar.Rule.name()) => Grammar.Rule.t()}
@type t() :: %Grammar{
  firsts: %{required(Grammar.Rule.name()) => [first()]},
  heap: list(),
  rules: rules(),
  stack: list()
}

Functions

Link to this function

add_clause(grammar, name, substitution, function, epsilon? \\ false)

View Source

Add a clause to the grammar.

A clause is defined by a name, a substitution, a function to execute when the clause is matched, and an epsilon flag.

Clauses sharing the same name are considered as clauses of a single rule.

The substitution is a list of terms which declares the steps to match the clause, from left to right, first to last.

Each term can be either

  • a rule name, which is an atom,
  • a value for which there is a TokenExtractor protocol implementation.

TokenExtractor implementations are provided for BitString and Regex, and can be extended to custom token types.

The function is a callback to execute when the clause is fully matched. It is given a list as parameter, each element of that list is the value produced by the substitution of each term.

The epsilon flag indicates if the rule is mandatory or not.

⚠️ the epsilon flag is used only when the first clause of a rule is added, even if it is defaulted to false.

Example

The second clause of :start is marked as epsilon, but it is ignored.

iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["hello"], fn [value] -> value end)
...> |> Grammar.add_clause(:start, ["world"], fn [value] -> value end, true) # `true` is ignored !
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("hello"))
{:ok, "hello"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("world"))
{:ok, "world"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("pouet"))
{:error, {1, 1}, :no_clause_matched}

The first clause of :start is marked as epsilon !

iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["hello"], fn [value] -> value end, true) # `true` is used !
...> |> Grammar.add_clause(:start, ["world"], fn [value] -> value end, false) # `false` is ignored !
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("hello"))
{:ok, "hello"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("world"))
{:ok, "world"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("pouet"))
{:ok, nil}
Link to this function

loop(grammar, tokenizer)

View Source
@spec loop(t(), Grammar.Tokenizer.t()) :: {:ok, any()} | {:error, error()}

Process the input tokenizer, until completion or error.

@spec new() :: t()

Create a new empty Grammar.

@spec prepare(t()) ::
  {:ok, t()}
  | {:error, :missing_rules, [Grammar.RulesChecker.miss()]}
  | {:error, :cycles_found, [Grammar.RulesChecker.path()]}
  | {:error, :ambiguities_found, [Grammar.RulesChecker.ambiguity()]}

Use this function after defining all the rules of the grammar to prepare the grammar for parsing.

This function will proceed to grammar validation, and returns errors if any are found.

⚠️ A grammar that is not prepared cannot be used for parsing, so step/2 and loop/2 will behave unexpectedly.

@spec prepare!(t()) :: t()

Same as prepare/1 but raises a RuntimeError if an error is found.

Link to this macro

rule(arg, list)

View Source (macro)

Use this macro to define rules of your grammar.

The first rule defined will be the entry rule of the grammar.

Calls to this macro sharing the same name will be grouped together as they define the same rule, each call is a possible path in the rule resolution.

Lets name a single call to rule a clause. All clauses must be disjointed, i.e. they must not share the same first token. They can be understood as the or operator in a rule.

Each rule of rule clause is defined by

  • a name, which is an atom
  • a definition, which is a list of atoms or token prototypes
  • a block, which is the code to execute when the clause is fully matched

When executed the code block is provided with a params binding, which is a list of the results of the clause steps.

In the case where a rule cannot be matched, a RuntimeError is raised (see rule?/2 for a relaxed version).

Example

iex> defmodule NumberOfNameListParser do
...>   use Grammar
...>
...>   rule start("[", :list_or_empty_list) do
...>     [_, list] = params
...>     list || []
...>   end
...>
...>   rule? list_or_empty_list(:item, :list_tail, "]") do
...>     [item, list_tail, _] = params
...>     [item | (list_tail || [])]
...>   end
...>
...>   rule? list_tail(",", :item, :list_tail) do
...>     [_, item, list_tail] = params
...>     [item | (list_tail || [])]
...>   end
...>
...>   rule item(~r/[0-9]+/) do
...>     [number] = params
...>     String.to_integer(number)
...>   end
...>
...>   rule item(~r/[a-zA-Z]+/) do
...>     [string] = params
...>     string
...>   end
...> end
iex> GrammarTest.NumberOfNameListParser.parse("[1, toto, 23]")
{:ok, [1, "toto", 23]}
Link to this macro

rule?(arg, list)

View Source (macro)

Same as rule/2 but relaxed : if the rule cannot be matched, it will be valued as nil.

Useful for optional or recursive rules.

See example in rule/2.

Link to this function

start(grammar, rule_name)

View Source
@spec start(t(), Grammar.Rule.name()) :: t()

Reset the grammar to its initial state, and set the starting rule name.

Usually this function is called once before loop/2to process some input.

Other usage may be to start at different rules within the same grammar, mainly for in dev testing.

Example

iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["<", :lists?, ">"], fn [_, ls, _] -> ls || [] end)
...> |> Grammar.add_clause(:lists?, [:list, :lists?], fn [l, ls] -> [l | ls || []] end, true)
...> |> Grammar.add_clause(:list, ["[", :elements?, "]"], fn [_, es, _] -> es || [] end)
...> |> Grammar.add_clause(:elements?, [:element, :elements?], fn [e, es] -> [e | es || []] end, true)
...> |> Grammar.add_clause(:element, [~r/[a-z]+/], fn [value] -> value end)
...> |> Grammar.prepare!()
...>
iex> g
...> |> Grammar.start(:start)
...> |> Grammar.loop(Grammar.Tokenizer.new("<[a b c] [dd ee ff]>"))
{:ok, [["a", "b", "c"], ["dd", "ee", "ff"]]}
...>
iex> g = Grammar.start(g, :element)
...> Grammar.loop(g, Grammar.Tokenizer.new("<[a b c] [dd ee ff]>"))
{:error, {1, 1}, :no_clause_matched}
iex> Grammar.loop(g, Grammar.Tokenizer.new("ff"))
{:ok, "ff"}
iex> g = Grammar.start(g, :list)
...> Grammar.loop(g, Grammar.Tokenizer.new("<[a b c] [dd ee ff]>"))
{:error, {1, 1}, :no_clause_matched}
iex> Grammar.loop(g, Grammar.Tokenizer.new("[a b c]"))
{:ok, ["a", "b", "c"]}
Link to this function

step(grammar, tokenizer)

View Source
@spec step(t(), Grammar.Tokenizer.t()) ::
  {:cont, t(), Grammar.Tokenizer.t()}
  | {:halt, :eof, t(), Grammar.Tokenizer.t()}
  | {:halt, error(), t(), Grammar.Tokenizer.t()}

Step once in the input Tokenizer, using grammar.

This function returns a tuple which first term is either :cont or :halt.

  • :cont: the second term is the updated grammar and the third term is the updated tokenizer.
  • :halt: the second term is an error tuple, the third term is the position of the error in the input string, the fourth term is the grammar and the fifth term is the tokenizer.

This function is used internally by loop/2, though it can be used to manually step through the parsing process.

Example

It takes 5 steps to reach the fist element of the input list.

iex> grammar = Grammar.new()
...> |> Grammar.add_clause(:start, [:loop?], &Enum.at(&1, 0, []))
...> |> Grammar.add_clause(:loop?, [:element, :loop?], fn [head, tail] -> [head | tail || []] end, true)
...> |> Grammar.add_clause(:element, [:ident], &Enum.at(&1, 0))
...> |> Grammar.add_clause(:ident, [~r/[a-z]+/], fn [ident] ->  ident end)
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
iex> tokenizer = Grammar.Tokenizer.new("a b c d")
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, _tokenizer} = Grammar.step(grammar, tokenizer)
iex> [{_callback, 1, ["a"]} | _] = grammar.heap