View Source Minipeg (Minipeg v0.5.0)
Minipeg is a minimal Parse Expression Grammars (PEG) Library
Here is a first taste of how to use it:
iex(1)> an_a_parser = char_parser("a")
...(1)> parse_string(an_a_parser, "a")
{:ok, "a"}
iex(2)> an_a_parser = char_parser("a")
...(2)> parse_string(an_a_parser, "b")
{:error, "b not member of \"a\" in char_parser(\"a\") <binary>:1,1"}
The first thing to note here is that in these doctests we have imported functions from Minipeg as follows, see the moduledocs of the corresponding modules for details
import Minipeg.Parser, only: [parse_string: 2]
import Minipeg.{Combinators, Parsers}
Basic Usage
Quite a small subset of predefined parsers and combinators would suffice to parse any context free language, however many of the patterns used to parse programming languages, little languages or small languages that exceed the practicallity of regular expressions are quite verbose.
Therefore Minipeg predefines parsers that can be easily parametrized. These parsers will be described in the utility parsers section.
Parsing Single Characters
With a character we do indeed mean a UTF-8 Code Point
The most basic parser is the...
char_parser
... which parses any character
iex(3)> parse_string(char_parser(), "h")
{:ok, "h"}
iex(4)> parse_string(char_parser(), "é")
{:ok, "é"}
iex(5)> parse_string(char_parser(), "✓")
{:ok, "✓"}
iex(6)> parse_string(char_parser(), "")
{:error, "encountered end of input in char_parser() <binary>:1,1"}
It can however be parametrized to parse only characters of a given set, this set can be provided as
a String
or an Enumerable
iex(7)> parse_string(char_parser("ab"), "b")
{:ok, "b"}
iex(8)> parse_string(char_parser(["b", "c"]), "b")
{:ok, "b"}
iex(9)> parse_string(char_parser(["b", "c"]), "a")
{:error, "a not member of \"bc\" in char_parser([\"b\", \"c\"]) <binary>:1,1"}
Often used charsets might be extremly large to be defined and therefore some more specialised parsers have been defined:
Parsing POSIX character classes
If instead of a string or list we pass an atom into char_parser
it only parses a character if it matches a character class as defined in POSIX regular expressions,
which are also described in the docs of the Regex
module, here are the currently supported values:
:alnum | :alpha | :blank | :cntrl | :digit | :graph | :lower | :print | :punct | :space | :upper | :word | :xdigit
iex(10)> parser = char_parser(:alnum)
...(10)> "aD7_%"
...(10)> |> String.graphemes
...(10)> |> Enum.map(&parse_string(parser, &1))
[
ok: "a",
ok: "D",
ok: "7",
error: "~r{\\A[[:alnum:]]}u does not match at 1,1 in char_parser(:alnum) <binary>:1,1",
error: "~r{\\A[[:alnum:]]}u does not match at 1,1 in char_parser(:alnum) <binary>:1,1"
]
escaped_char_parser
this parser helps to parse escaped characters, while one could de this quite easily with the following example, one notices
that in order to get just an escpaed character two combinators, map
and sequence
are needed
iex(11)> escaped_quote_parser = sequence([
...(11)> char_parser("\\"), char_parser()])
...(11)> |> map(&Enum.at(&1, 1))
...(11)> { parse_string(escaped_quote_parser, "\\\""), parse_string(escaped_quote_parser, "\\'") }
{ {:ok, "\""}, {:ok, "'"} }
Compare this to the provided escaped_char_parser
:
iex(12)> parse_string(escaped_char_parser(), "\\a")
{:ok, "a"}
We can also change the escape character
iex(13)> parse_string(escaped_char_parser("%"), "%a")
{:ok, "a"}
iex(14)> parse_string(escaped_char_parser("%"), "\\a")
{:error, "\\ not member of \"%\" in char_parser(\"%\") in escaped_char_parser(%) <binary>:1,1"}
And furthermore we can restrict the set of which characters are allowed to be escaped
iex(15)> parser = escaped_char_parser("\\", "escape only \\", "\\")
...(15)> { parse_string(parser, "\\\\"), parse_string(parser, "\\a") }
{ {:ok, "\\"}, {:error, "a not member of \"\\\\\" in char_parser(\"\\\\\") in escape only \\ <binary>:1,1"} }
Parsing sequences of characters
Keywords: keywords_parser
Does pretty much what is expected ;)
iex(16)> kwd_parser = keywords_parser(["do", "else", "if"])
...(16)> ["do", "if", "for"]
...(16)> |> Enum.map(&parse_string(kwd_parser, &1))
[
ok: "do",
ok: "if",
error: "no alternative could be parsed in keywords_parser([\"do\", \"else\", \"if\"]) <binary>:1,1"
]
Identifiers: ident_parser
An identifier is defined by a character class for its first character and a character class for its subsequent characters, so one could define it roughly as
sequence([
first_char_parser,
many(second_char_paser)])
And that is how the ident_parser
is actually defined
iex(17)> parse_string(ident_parser(), "hello_42")
{:ok, "hello_42"}
iex(18)> parse_string(ident_parser(), "42hello_world")
{:error, "~r{\\A[[:alpha:]]}u does not match at 1,1 in char_parser(:alpha) in sequence <binary>:1,1"}
In Lisp we prefer -
to _
, no problem
iex(19)> parse_string(ident_parser("Lisp Style", additional_chars: "-"), "hello-42")
{:ok, "hello-42"}
But even the parser for the first character and the following characters can be defined
iex(20)> register_parser = ident_parser(
...(20)> "Uppercase and digit",
...(20)> first_char_parser: char_parser(:upper),
...(20)> rest_char_parser: char_parser(:digit),
...(20)> additional_chars: nil,
...(20)> max_len: 2,
...(20)> min_len: 2)
...(20)> [
...(20)> parse_string(register_parser, "R2"),
...(20)> parse_string(register_parser, "X_"),
...(20)> parse_string(register_parser, "R12"),
...(20)> parse_string(register_parser, "a2"),
...(20)> parse_string(register_parser, "ab")
...(20)> ]
[
ok: "R2",
error: "string \"X\" length 1 under required minimum 2 <binary>:1,1",
error: "string \"R12\" length 3 exceeds allowed 2 <binary>:1,1",
error: "~r{\\A[[:upper:]]}u does not match at 1,1 in char_parser(:upper) in sequence <binary>:1,1",
error: "~r{\\A[[:upper:]]}u does not match at 1,1 in char_parser(:upper) in sequence <binary>:1,1"
]
In some environments we would like to restrict the length of an identifier
iex(21)> dos_name_parser = ident_parser("dos name parser", max_len: 8)
...(21)> [
...(21)> parse_string(dos_name_parser, "dosok"),
...(21)> parse_string(dos_name_parser, "way_too_long")
...(21)> ]
[
ok: "dosok",
error: "string \"way_too_long\" length 12 exceeds allowed 8 <binary>:1,1"
]
}
Regex Parsers
Sometimes it is cumbersome to specify a parser that can be expressed simply with a regular expression.
A good example of this would be the ident_parser
from above.
A Regex Parser will always create an anchored regex which will be parsed against the start of the input. As long as you avoid backtracking or unbond lookahead the performance should be at the same level as writing a parser "by hand".
The Regex
that will be used in the parser will be compiled as fomllows from the string parameter specifying
it:
Regex.compile!("\\A" <> param, [:unicode])
rgx_parser
This is the basic parser that creates a regular expression as described above and if it parses puts the result of
Regex.run(compiled_rgx, input.input)
into the ast field of the Success
structure, of course the matching string is
removed from the returned input
iex(22)> atom_parser = rgx_parser(":[[:alpha:]][[:word:]]*", "rgx based atom parser")
...(22)> Parser.parse(atom_parser, Input.new(":atom_42"), %Cache{})
%Success{ast: [":atom_42"], cache: %Cache{}, rest: %Input{col: 9, context: %{}, input: "", lnb: 1}}
iex(23)> atom_parser = rgx_parser(":[[:alpha:]][[:word:]]*", "rgx based atom parser")
...(23)> parse_string(atom_parser, "hello")
{:error, "~r{\\A:[[:alpha:]][[:word:]]*}u does not match at 1,1 in rgx based atom parser <binary>:1,1"}
rgx_match_parser
Oftentimes we will only want the whole match and do not care of captures, enter rgx_match_parser
iex(24)> atom_parser = rgx_match_parser(":[[:alpha:]][[:word:]]*", "rgx based atom parser")
...(24)> Parser.parse(atom_parser, Input.new(":atom_42"), %Cache{})
%Success{ast: ":atom_42", cache: %Cache{}, rest: %Input{col: 9, context: %{}, input: "", lnb: 1}}
The :unicode
will always be used in the compiled regex, however one can add other options
A nice addition is the :extended
option
iex(25)> a_list_parser = rgx_match_parser(" a (?: , a)+ ", nil, [:extended])
...(25)> parse_string(a_list_parser, "a,a,abc")
{:ok, "a,a,a"}
rgx_capture_parser
Also quite often we are only interested in one capture
iex(26)> number_parser = rgx_capture_parser("\\s*(\\d+)") |> map(&String.to_integer/1)
...(26)> parse_string(number_parser, " 42")
{:ok, 42}
Essential Combinators
many
Parses an input if a parser can be applied many times to it. This means that many
can parse
an empty input unless a minimum count is specified:
iex(27)> a_parser = many(char_parser("a"))
...(27)> assert parse_string(a_parser, "") == {:ok, []}
...(27)> assert parse_string(a_parser, "aaa") == {:ok, ~W[a a a]}
But if we specify a min_count
...
iex(28)> a_parser = many(char_parser("a"), "at least one", 1)
...(28)> assert parse_string(a_parser, "") == {:error, "Missing 1 parses in many in at least one <binary>:1,1"}
...(28)> assert parse_string(a_parser, "aaa") == {:ok, ~W[a a a]}
sequence
Takes a list of Parsers
, only parses if all of them parse subsequently on the given
input and return a list of the results of each parser.
Let us remimplement the keywords parser
iex(29)> if_parser = sequence([char_parser("i"), char_parser("f")])
...(29)> |> map(&Enum.join/1)
...(29)> ~w[if else]
...(29)> |> Enum.map(&parse_string(if_parser, &1))
[
ok: "if",
error: "e not member of \"i\" in char_parser(\"i\") in sequence <binary>:1,1"
]
This leads us directly to
map
Map, takes a parser and a mapping function. It returns a new parser that fails with exactly the same error message as its input parser, but succeeds with the result mapped by the mapping function.
iex(30)> list_parser = many(char_parser()) |> map(&Enum.join(&1, ", "))
...(30)> parse_string(list_parser, "abc")
{:ok, "a, b, c"}
Oftentimes we will want the position of a parsed string to be included into the ast. An obvious use case to identify where in the source a semantic error has occurred.
Enter mapp
This example also demonstrates the ignore
combinator which will be ignored in sequence
iex(31)> a_parser = sequence([ws_parser() |> ignore(), char_parser("a") |> mapp(&{&1, &2})])
...(31)> [parse_string(a_parser, " a"), parse_string(a_parser, "")]
[
ok: [{"a", {3, 1}}],
error: "encountered end of input in char_parser(\"a\") in sequence <binary>:1,1"
]
ignore
does of course not mean that the input does not need to parse
iex(32)> a_parser = sequence([char_parser("b") |> ignore(), char_parser("a") |> mapp(&{&1, &2})])
...(32)> parse_string(a_parser, "a")
{
:error, "a not member of \"b\" in char_parser(\"b\") in sequence <binary>:1,1"
}
select
Oftentimes select
is (maybe better) named choice
we have therefore define an alias for choice
iex(33)> vowel_parser = select(~W[a e i o u y] |> Enum.map(&char_parser/1), "vowel_parser")
...(33)> ~W[a u y x] |> Enum.map(&parse_string(vowel_parser, &1))
[
ok: "a",
ok: "u",
ok: "y",
error: "no alternative could be parsed in vowel_parser <binary>:1,1"
]
And the aliased option
iex(34)> parser = option([char_parser("a"), char_parser("b")], "option_parser")
...(34)> ~W[b x] |> Enum.map(&parse_string(parser, &1))
[
ok: "b",
error: "no alternative could be parsed in option_parser <binary>:1,1"
]
end
many_sel
Just a shortcut for many(select(...
iex(35)> parser = many_sel([char_parser("a"), char_parser("b")])
...(35)> parse_string(parser, "abba")
{:ok, ~W[a b b a]}
many_seq
Just a shortcut for many(sequence(...
iex(36)> parser = many_seq([char_parser("a"), char_parser("b")])
...(36)> parse_string(parser, "abab")
{:ok, [~W[a b], ~W[a b]]}
In the result we can see that it is important to take into consideration that this is indeed two nested parsers and one will often do things as the following
iex(37)> parser = many_seq([char_parser("a"), char_parser("b")]) |> map(&IO.chardata_to_string/1)
...(37)> parse_string(parser, "abab")
{:ok, "abab"}
Convenience Combinators
What about whitespace
Oftentimes whitespace shall be ignored in the resulting ast, and sometimes in the input too. To be more precise when whitespace stops parsing of example a keyword then the subsequent patser often is not interested in$ the left ofer ws preceeding its new input.
Enter ignore_ws
Here is the form that does not ignore newlines, which is the default:
iex(38)> next_char_parser = ignore_ws(char_parser())
...(38)> parse_string(next_char_parser, " \ta")
{:ok, "a"}
iex(39)> next_a_parser = ignore_ws(char_parser("a"))
...(39)> parse_string(next_a_parser, " \na")
{:error, "\n not member of \"a\" in char_parser(\"a\") <binary>:1,2"}
But we can also use the newline allowing version
iex(40)> next_a_parser = ignore_ws(char_parser("a"), "skip newlines", true)
...(40)> parse_string(next_a_parser, " \na")
{:ok, "a"}
upto_parser_parser
Oftentimes parsing algorithms become more read- and maintanable when we reparse a part of the
input stream with a different parser. In order to be able to do this we can just parse up to
a part of the input stream defined by a parser and return the input stream up to that point
as a String
N.B. This convenience comes with a price, the parser
will try to match for every position
in the input stream until it succeeds or fails on an empty input. Hence use with care.
iex(41)> upto_end_parser = upto_parser_parser(keywords_parser(~W[end]))
...(41)> parse_string(upto_end_parser, "up to end")
{:ok, "up to "}
If however parser
never succeeds the upto_parser_parser
fails.
iex(42)> upto_end_parser = upto_parser_parser(keywords_parser(~W[end]))
...(42)> parse_string(upto_end_parser, "up to en")
{:error, "encountered end of input in upto_parser_parser(keywords_parser([\"end\"]), keep) <binary>:1,9"}
We can also ask to include the ast from the parser
into the result
iex(43)> upto_end_parser = upto_parser_parser(keywords_parser(~W[end]), "my parser", :include)
...(43)> parse_string(upto_end_parser, "up to end")
{:ok, {"up to ", "end"}}
Or to discard it, which is not the default case (which is :keep
)
iex(44)> keep_parser = upto_parser_parser(keywords_parser(~W[end]), "my parser", :keep)
...(44)> discard_parser = upto_parser_parser(keywords_parser(~W[end]), "my other parser", :discard)
...(44)> [ parse(keep_parser, "up to end"), parse(discard_parser, "up to end")]
[
%Minipeg.Success{ast: "up to ", cache: %Minipeg.Cache{cache: %{}}, parsed_at: {7, 1}, rest: %Minipeg.Input{input: "end", col: 7, lnb: 1}},
%Minipeg.Success{ast: "up to ", cache: %Minipeg.Cache{cache: %{}}, parsed_at: {7, 1}, rest: %Minipeg.Input{input: "", col: 10, lnb: 1}}
]
Definitions
Parser
A Parser
is a struct that parses an Input
struct (with the parse
function) and either returns a Success
or Failure
struct
In order to abstract the internal representations of input and results the parse_string
function is provided as shown in the exampleas above.
The Success
struct contains the resulting Abstract Syntaxt Tree and the rest of the input as an Input
struct.
The Failure
struct contains the original Input
struct and an error message
Internally a Cache
is already returned (and passed into subsequent parse
calls of the Parser
module) but unless
you are extending Minipeg
itself by defining parsers by hand instead of using Combinators
you can ignore this.
Combinator
A Combibator
is a function that takes a Parser
optionally some arguments and returns a new Parser