View Source Grammar (Grammar v0.3.0)
This module exposes functions and macros to create parsers of structured inputs. Parsers are defined as LL(1) grammars.
A grammar must be LL(1), i.e. must be unambiguous and have a single token lookahead.
One can create parsers at runtime, using the Grammar
functions, or define parsers at compile time using the Grammar
macros.
The grammar is defined by a set of rules, each rule being a set of clauses. Clauses must be understood as
disjoinded paths in the rule resolution, or as the or
operator in the classic notation.
The tokenization process relies on the TokenExtractor protocol, which is used to extract tokens from the input string. This protocol is implemented for BitString and Regex, and can be extended to custom token types.
Spaces and line breaks handling
By default, the tokenizer will drop spaces and line breaks.
If you want to keep them, you can pass the drop_spaces: false
option to the use Grammar
macro.
When using the Tokenizer directly, you must pass false
as second parameter to Tokenizer.new/2
to keep spaces and line breaks.
In this case, you are fully responsible for handling spaces and line breaks in your rules.
Using the API
Creating a new grammar is done by calling Grammar.new/0
, then adding clauses to it using Grammar.add_clause/5
.
Once all the rules are defined, the grammar must be prepared using Grammar.prepare/1
.
The grammar is then ready to be used for parsing, by calling Grammar.start/2
with the starting rule name,
and then Grammar.loop/2
with the input string wrapped in a Tokenizer
.
Example
iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["hello", :what], fn ["hello", what] -> "hello #{what} !" end)
...> |> Grammar.add_clause(:what, [~r/[a-zA-Z]+/], fn [ident] -> ident end)
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
iex> Grammar.loop(g, Grammar.Tokenizer.new("hello world"))
{:ok, "hello world !"}
Using the DSL
To declare a parser module, just use
the Grammar module in your module, and define your rules using the rule/2
and rule?/2
macro.
The newly defined module will expose a parse/1
function that will parse the input string, and return a tuple with the
tokenizer in its final state, and result.
See rule/2
for a full example.
Options
drop_spaces: true
(default): if set tofalse
, the tokenizer will not drop spaces and line breaks.
Example
In the following MyModuleKO
module, the start
rule doesn't handle spaces and line breaks,
so it will fail if the input contains them.
iex> defmodule MyModuleKO do
...> use Grammar, drop_spaces: false
...>
...> # spaces and linebreaks not handled
...>
...> rule start("hello", "world") do
...> [_hello, _world] = params
...> "hello world"
...> end
...> end
iex> MyModuleKO.parse("helloworld")
{:ok, "hello world"}
iex> MyModuleKO.parse("hello world")
{:error, {1, 6}, :no_token}
But in the MyModuleOK
module, the start
explicitly handles spaces and line breaks between "hello" and "world".
And even more that rule definition requires at least one space between "hello" and "world", parsing fails if no space is found.
iex> defmodule MyModuleOK do
...> use Grammar, drop_spaces: false
...>
...> # spaces and linebreaks not handled
...>
...> rule start("hello", ~r/[\s]+/, "world") do
...> [_hello, _spaces, _world] = params
...> "hello world"
...> end
...> end
iex> MyModuleOK.parse("helloworld")
{:error, {1, 6}, :no_token}
iex> MyModuleOK.parse(~s/hello \t world/)
{:ok, "hello world"}
Summary
Functions
Add a clause to the grammar.
Process the input tokenizer, until completion or error.
Use this function after defining all the rules of the grammar to prepare the grammar for parsing.
Same as prepare/1 but raises a RuntimeError if an error is found.
Use this macro to define rules of your grammar.
Same as rule/2
but relaxed : if the rule cannot be matched, it will be valued as nil
.
Reset the grammar to its initial state, and set the starting rule name.
Step once in the input Tokenizer, using grammar
.
Types
@type error() :: {error_type(), term()}
@type error_type() :: :no_clause_matched | :no_token
@type first() :: term()
@type rules() :: %{required(Grammar.Rule.name()) => Grammar.Rule.t()}
@type t() :: %Grammar{ firsts: %{required(Grammar.Rule.name()) => [first()]}, heap: list(), rules: rules(), stack: list() }
Functions
add_clause(grammar, name, substitution, function, epsilon? \\ false)
View Source@spec add_clause( t(), Grammar.Rule.name(), Grammar.Clause.substitution(), Grammar.Clause.callback(), boolean() ) :: t()
Add a clause to the grammar.
A clause is defined by a name, a substitution, a function to execute when the clause is matched, and an epsilon flag.
Clauses sharing the same name are considered as clauses of a single rule.
The substitution is a list of terms which declares the steps to match the clause, from left to right, first to last.
Each term can be either
- a rule name, which is an atom,
- a value for which there is a TokenExtractor protocol implementation.
TokenExtractor implementations are provided for BitString and Regex, and can be extended to custom token types.
The function is a callback to execute when the clause is fully matched. It is given a list as parameter, each element of that list is the value produced by the substitution of each term.
The epsilon flag indicates if the rule is mandatory or not.
⚠️ the epsilon flag is used only when the first clause of a rule is added, even if it is defaulted to
false
.
Example
The second clause of :start
is marked as epsilon, but it is ignored.
iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["hello"], fn [value] -> value end)
...> |> Grammar.add_clause(:start, ["world"], fn [value] -> value end, true) # `true` is ignored !
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("hello"))
{:ok, "hello"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("world"))
{:ok, "world"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("pouet"))
{:error, {1, 1}, :no_clause_matched}
The first clause of :start
is marked as epsilon !
iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["hello"], fn [value] -> value end, true) # `true` is used !
...> |> Grammar.add_clause(:start, ["world"], fn [value] -> value end, false) # `false` is ignored !
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("hello"))
{:ok, "hello"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("world"))
{:ok, "world"}
...>
iex> Grammar.loop(g, Grammar.Tokenizer.new("pouet"))
{:ok, nil}
@spec loop(t(), Grammar.Tokenizer.t()) :: {:ok, any()} | {:error, error()}
Process the input tokenizer, until completion or error.
@spec new() :: t()
Create a new empty Grammar
.
@spec prepare(t()) :: {:ok, t()} | {:error, :missing_rules, [Grammar.RulesChecker.miss()]} | {:error, :cycles_found, [Grammar.RulesChecker.path()]} | {:error, :ambiguities_found, [Grammar.RulesChecker.ambiguity()]}
Use this function after defining all the rules of the grammar to prepare the grammar for parsing.
This function will proceed to grammar validation, and returns errors if any are found.
⚠️ A grammar that is not prepared cannot be used for parsing, so
step/2
andloop/2
will behave unexpectedly.
Same as prepare/1 but raises a RuntimeError if an error is found.
Use this macro to define rules of your grammar.
The first rule defined will be the entry rule of the grammar.
Calls to this macro sharing the same name will be grouped together as they define the same rule, each call is a possible path in the rule resolution.
Lets name a single call to rule
a clause.
All clauses must be disjointed, i.e. they must not share the same first token.
They can be understood as the or
operator in a rule.
Each rule of rule clause is defined by
- a name, which is an atom
- a definition, which is a list of atoms or token prototypes
- a block, which is the code to execute when the clause is fully matched
When executed the code block is provided with a params
binding, which is a list of the results of the clause steps.
In the case where a rule
cannot be matched, a RuntimeError
is raised (see rule?/2
for a relaxed version).
Example
iex> defmodule NumberOfNameListParser do
...> use Grammar
...>
...> rule start("[", :list_or_empty_list) do
...> [_, list] = params
...> list || []
...> end
...>
...> rule? list_or_empty_list(:item, :list_tail, "]") do
...> [item, list_tail, _] = params
...> [item | (list_tail || [])]
...> end
...>
...> rule? list_tail(",", :item, :list_tail) do
...> [_, item, list_tail] = params
...> [item | (list_tail || [])]
...> end
...>
...> rule item(~r/[0-9]+/) do
...> [number] = params
...> String.to_integer(number)
...> end
...>
...> rule item(~r/[a-zA-Z]+/) do
...> [string] = params
...> string
...> end
...> end
iex> GrammarTest.NumberOfNameListParser.parse("[1, toto, 23]")
{:ok, [1, "toto", 23]}
Same as rule/2
but relaxed : if the rule cannot be matched, it will be valued as nil
.
Useful for optional or recursive rules.
See example in rule/2
.
@spec start(t(), Grammar.Rule.name()) :: t()
Reset the grammar to its initial state, and set the starting rule name.
Usually this function is called once before loop/2
to process some input.
Other usage may be to start at different rules within the same grammar, mainly for in dev testing.
Example
iex> g = Grammar.new()
...> |> Grammar.add_clause(:start, ["<", :lists?, ">"], fn [_, ls, _] -> ls || [] end)
...> |> Grammar.add_clause(:lists?, [:list, :lists?], fn [l, ls] -> [l | ls || []] end, true)
...> |> Grammar.add_clause(:list, ["[", :elements?, "]"], fn [_, es, _] -> es || [] end)
...> |> Grammar.add_clause(:elements?, [:element, :elements?], fn [e, es] -> [e | es || []] end, true)
...> |> Grammar.add_clause(:element, [~r/[a-z]+/], fn [value] -> value end)
...> |> Grammar.prepare!()
...>
iex> g
...> |> Grammar.start(:start)
...> |> Grammar.loop(Grammar.Tokenizer.new("<[a b c] [dd ee ff]>"))
{:ok, [["a", "b", "c"], ["dd", "ee", "ff"]]}
...>
iex> g = Grammar.start(g, :element)
...> Grammar.loop(g, Grammar.Tokenizer.new("<[a b c] [dd ee ff]>"))
{:error, {1, 1}, :no_clause_matched}
iex> Grammar.loop(g, Grammar.Tokenizer.new("ff"))
{:ok, "ff"}
iex> g = Grammar.start(g, :list)
...> Grammar.loop(g, Grammar.Tokenizer.new("<[a b c] [dd ee ff]>"))
{:error, {1, 1}, :no_clause_matched}
iex> Grammar.loop(g, Grammar.Tokenizer.new("[a b c]"))
{:ok, ["a", "b", "c"]}
@spec step(t(), Grammar.Tokenizer.t()) :: {:cont, t(), Grammar.Tokenizer.t()} | {:halt, :eof, t(), Grammar.Tokenizer.t()} | {:halt, error(), t(), Grammar.Tokenizer.t()}
Step once in the input Tokenizer, using grammar
.
This function returns a tuple which first term is either :cont
or :halt
.
-
:cont
: the second term is the updatedgrammar
and the third term is the updatedtokenizer
. -
:halt
: the second term is an error tuple, the third term is the position of the error in the input string, the fourth term is thegrammar
and the fifth term is thetokenizer
.
This function is used internally by loop/2
, though it can be used to manually step through the parsing process.
Example
It takes 5 steps to reach the fist element of the input list.
iex> grammar = Grammar.new()
...> |> Grammar.add_clause(:start, [:loop?], &Enum.at(&1, 0, []))
...> |> Grammar.add_clause(:loop?, [:element, :loop?], fn [head, tail] -> [head | tail || []] end, true)
...> |> Grammar.add_clause(:element, [:ident], &Enum.at(&1, 0))
...> |> Grammar.add_clause(:ident, [~r/[a-z]+/], fn [ident] -> ident end)
...> |> Grammar.prepare!()
...> |> Grammar.start(:start)
iex> tokenizer = Grammar.Tokenizer.new("a b c d")
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> {:cont, grammar, tokenizer} = Grammar.step(grammar, tokenizer)
iex> [{_callback, 1, ["a"]} | _] = grammar.heap