View Source Minipeg (Minipeg v0.6.5)

Minipeg is a minimal Parse Expression Grammars (PEG) Library

Here is a first taste of how to use it:

iex(1)> an_a_parser = char_parser("a")
...(1)> parse_string(an_a_parser, "a")
{:ok, "a"}

iex(2)> an_a_parser = char_parser("a")
...(2)> parse_string(an_a_parser, "b")
{:error, "b not member of \"a\" in char_parser(\"a\") (in char_parser(\"a\")) in <binary>:1,1"}

The first thing to note here is that in these doctests we have imported functions from Minipeg as follows, see the moduledocs of the corresponding modules for details

    import Minipeg.Parser, only: [parse_string: 2]
    import Minipeg.{Combinators, Parsers}

Basic Usage

Quite a small subset of predefined parsers and combinators would suffice to parse any context free language, however many of the patterns used to parse programming languages, little languages or small languages that exceed the practicallity of regular expressions are quite verbose.

Therefore Minipeg predefines parsers that can be easily parametrized. These parsers will be described in the utility parsers section.

Parsing Single Characters

With a character we do indeed mean a UTF-8 Code Point

The most basic parser is the...

char_parser

... which parses any character

iex(3)> parse_string(char_parser(), "h")
{:ok, "h"}


iex(4)> parse_string(char_parser(), "é")
{:ok, "é"}


iex(5)> parse_string(char_parser(), "✓")
{:ok, "✓"}

iex(6)> parse_string(char_parser(), "")
{:error, "encountered end of input (in char_parser()) in <binary>:1,1"}

It can however be parametrized to parse only characters of a given set, this set can be provided as a String or an Enumerable

iex(7)> parse_string(char_parser("ab"), "b")
{:ok, "b"}

iex(8)> parse_string(char_parser(["b", "c"]), "b")
{:ok, "b"}


iex(9)> parse_string(char_parser(["b", "c"]), "a")
{:error, "a not member of \"bc\" in char_parser([\"b\", \"c\"]) (in char_parser([\"b\", \"c\"])) in <binary>:1,1"}

Often used charsets might be extremly large to be defined and therefore some more specialised parsers have been defined:

Parsing POSIX character classes

If instead of a string or list we pass an atom into char_parser it only parses a character if it matches a character class as defined in POSIX regular expressions, which are also described in the docs of the Regex module, here are the currently supported values:

  :alnum | :alpha | :blank | :cntrl | :digit | :graph | :lower | :print | :punct | :space | :upper | :word | :xdigit
iex(10)> parser = char_parser(:alnum)
...(10)> "aD7_%"
...(10)> |> String.graphemes
...(10)> |> Enum.map(&parse_string(parser, &1))
[
ok: "a",
ok: "D",
ok: "7",
error: "~r{\\A[[:alnum:]]}u does not match at 1,1 in char_parser(:alnum) (in char_parser(:alnum)) in <binary>:1,1",
error: "~r{\\A[[:alnum:]]}u does not match at 1,1 in char_parser(:alnum) (in char_parser(:alnum)) in <binary>:1,1"
]

escaped_char_parser

this parser helps to parse escaped characters, while one could de this quite easily with the following example, one notices that in order to get just an escpaed character two combinators, map and sequence are needed

iex(11)> escaped_quote_parser = sequence([
...(11)> char_parser("\\"), char_parser()])
...(11)> |> map(&Enum.at(&1, 1))
...(11)> { parse_string(escaped_quote_parser, "\\\""), parse_string(escaped_quote_parser, "\\'") }
{ {:ok, "\""}, {:ok, "'"} }

Compare this to the provided escaped_char_parser:

iex(12)> parse_string(escaped_char_parser(), "\\a")
{:ok, "a"}

We can also change the escape character

iex(13)> parse_string(escaped_char_parser("%"), "%a")
{:ok, "a"}

iex(14)> parse_string(escaped_char_parser("%"), "\\a")
{:error, "\\ not member of \"%\" in char_parser(\"%\") (in escaped_char_parser) in <binary>:1,1"}

And furthermore we can restrict the set of which characters are allowed to be escaped

iex(15)> parser = escaped_char_parser("\\", "escape only \\", "\\")
...(15)> { parse_string(parser, "\\\\"), parse_string(parser, "\\a") }
{ {:ok, "\\"}, {:error, "a not member of \"\\\\\" in char_parser(\"\\\\\") (in escaped_char_parser) in <binary>:1,1"} }

Parsing sequences of characters

Keywords: keywords_parser

Does pretty much what is expected ;)

iex(16)> kwd_parser = keywords_parser(["do", "else", "if"])
...(16)> ["do", "if", "for"]
...(16)> |> Enum.map(&parse_string(kwd_parser, &1))
[
ok: "do",
ok: "if",
error: "no alternative could be parsed in keywords_parser([\"do\", \"else\", \"if\"]) (in keywords_parser([\"do\", \"else\", \"if\"])) in <binary>:1,1"
]

Identifiers: ident_parser

An identifier is defined by a character class for its first character and a character class for its subsequent characters, so one could define it roughly as

    sequence([
      first_char_parser,
      many(second_char_paser)])

And that is how the ident_parser is actually defined

iex(17)> parse_string(ident_parser(), "hello_42")
{:ok, "hello_42"}

iex(18)> parse_string(ident_parser(), "42hello_world")
{:error, "~r{\\A[[:alpha:]]}u does not match at 1,1 in char_parser(:alpha) (in char_parser(:alpha)) in <binary>:1,1"}

In Lisp we prefer - to _, no problem

iex(19)> parse_string(ident_parser("Lisp Style", additional_chars: "-"), "hello-42")
{:ok, "hello-42"}

But even the parser for the first character and the following characters can be defined

iex(20)> register_parser = ident_parser(
...(20)>   "Uppercase and digit",
...(20)>   first_char_parser: char_parser(:upper),
...(20)>   rest_char_parser: char_parser(:digit),
...(20)>   additional_chars: nil,
...(20)>   max_len: 2,
...(20)>   min_len: 2)
...(20)> [
...(20)>   parse_string(register_parser, "R2"),
...(20)>   parse_string(register_parser, "X_"),
...(20)>   parse_string(register_parser, "R12"),
...(20)>   parse_string(register_parser, "a2"),
...(20)>   parse_string(register_parser, "ab")
...(20)> ]
[
ok: "R2",
error: "string \"X\" length 1 under required minimum 2 (in Uppercase and digit) in <binary>:1,1",
error: "string \"R12\" length 3 exceeds allowed 2 (in Uppercase and digit) in <binary>:1,1",
error: "~r{\\A[[:upper:]]}u does not match at 1,1 in char_parser(:upper) (in char_parser(:upper)) in <binary>:1,1",
error: "~r{\\A[[:upper:]]}u does not match at 1,1 in char_parser(:upper) (in char_parser(:upper)) in <binary>:1,1"
]

In some environments we would like to restrict the length of an identifier

iex(21)> dos_name_parser = ident_parser("dos name parser", max_len: 8)
...(21)> [
...(21)>   parse_string(dos_name_parser, "dosok"),
...(21)>   parse_string(dos_name_parser, "way_too_long")
...(21)> ]
[
ok: "dosok",
error: "string \"way_too_long\" length 12 exceeds allowed 8 (in dos name parser) in <binary>:1,1"
]

    }

Regex Parsers

Sometimes it is cumbersome to specify a parser that can be expressed simply with a regular expression. A good example of this would be the ident_parser from above.

A Regex Parser will always create an anchored regex which will be parsed against the start of the input. As long as you avoid backtracking or unbond lookahead the performance should be at the same level as writing a parser "by hand".

The Regex that will be used in the parser will be compiled as fomllows from the string parameter specifying it:

    Regex.compile!("\\A" <> param, [:unicode])

rgx_parser

This is the basic parser that creates a regular expression as described above and if it parses puts the result of Regex.run(compiled_rgx, input.input) into the ast field of the Success structure, of course the matching string is removed from the returned input

iex(22)> atom_parser = rgx_parser(":[[:alpha:]][[:word:]]*", "rgx based atom parser")
...(22)> Parser.parse(atom_parser, Input.new(":atom_42"), %Cache{})
%Success{ast: [":atom_42"], cache: %Cache{}, parsed_by: "rgx based atom parser", rest: %Input{col: 9, context: %{}, input: "", lnb: 1}}

iex(23)> atom_parser = rgx_parser(":[[:alpha:]][[:word:]]*", "rgx based atom parser")
...(23)> parse_string(atom_parser, "hello")
{:error, "~r{\\A:[[:alpha:]][[:word:]]*}u does not match at 1,1 in rgx based atom parser (in rgx based atom parser) in <binary>:1,1"}

rgx_match_parser

Oftentimes we will only want the whole match and do not care of captures, enter rgx_match_parser

iex(24)> atom_parser = rgx_match_parser(":[[:alpha:]][[:word:]]*", "rgx based atom parser")
...(24)> Parser.parse(atom_parser, Input.new(":atom_42"), %Cache{})
%Success{ast: ":atom_42", cache: %Cache{}, parsed_by: "rgx based atom parser", rest: %Input{col: 9, context: %{}, input: "", lnb: 1}}

The :unicode will always be used in the compiled regex, however one can add other options A nice addition is the :extended option

iex(25)> a_list_parser = rgx_match_parser(" a (?: , a)+ ", nil, [:extended])
...(25)> parse_string(a_list_parser, "a,a,abc")
{:ok, "a,a,a"}

rgx_capture_parser

Also quite often we are only interested in one capture

iex(26)> number_parser = rgx_capture_parser("\\s*(\\d+)") |> map(&String.to_integer/1)
...(26)> parse_string(number_parser, "  42")
{:ok, 42}

Token Parser, defining Tokens with Regular Expressions

Let us start with a simple example that demonstrates the concept

iex(27)> tokens = [
...(27)>   {:number, "\\d+"},
...(27)>   {:number, "\\+(\\d+)"} ]
...(27)> parser = token_parser(tokens)
...(27)> assert parse_string(parser, "42") == {:ok, {:number, ["42"]}}
...(27)> assert parse_string(parser, "+42") == {:ok, {:number,  ["+42", "42"]}}

Postprocessing

We do have all captures in the ast, because, of course, they might be needed, in our case this is not desired and as we want to convert the values anyway, this can be achieved with post-processing, as follows:

iex(28)> tokens = [
...(28)>   {:number, "\\d+", fn [n] -> {:number, String.to_integer(n)} end},
...(28)>   {:number, "\\+(\\d+)", fn [_, n] -> {:number, String.to_integer(n)} end} ]
...(28)> parser = token_parser(tokens)
...(28)> assert parse_string(parser, "42") == {:ok, {:number, 42}}
...(28)> assert parse_string(parser, "+42") == {:ok, {:number, 42}}

There are many use cases that allow to implement simple grammars in a concise way, especially if they are not recursive. Here is a real world example, used by colorize

Note however that defining a parser for color might be a better alternative for the strictness of the parser, e.g. restricting to certain colors

iex(29)> tokens = [
...(29)>     {:verb, "\\$\\$", fn _ -> {:verb, "$"} end},
...(29)>     {:reset, "\\$"},
...(29)>     {:reset, "<reset>"},
...(29)>     {:verb, "<<", fn _ -> {:verb, "<"} end},
...(29)>     {:color, "<([^,]+),([^,>]+)>", fn [_, col, style] ->  {:color, col, style } end},
...(29)>     {:color, "<([^,>]+)>"},
...(29)>     {:verb, "[^<\\$]+"} ]
...(29)>
...(29)> color_parser = many(token_parser(tokens, flatten_ast: true))
...(29)> {
...(29)>  parse_string(color_parser, "$$"),
...(29)>  parse_string(color_parser, "<red>$$$"),
...(29)>
...(29)> }
{
{:ok, [verb: "$"]},
{:ok, [color: "red", verb: "$", reset: "$"]},
}

Flatten the AST

If however all we need in the AST is the first capture or the whole match, the flatten_ast option can be used:

iex(30)> tokens = [
...(30)>   {:number, "\\s*(\\d+)"},
...(30)>   {:name, "\\s*(\\w+)"},
...(30)>   {:any, ".+"} ]
...(30)> parser = many(token_parser(tokens, flatten_ast: true))
...(30)> parse_string(parser, " 42 hello ,x")
{:ok, [number: "42", name: "hello", any: " ,x"]}

Using the skip option

In many cases the above pattern repeats in a way that we ignore whitespace before tokens, we can simply imply this by passing a regular expression or string to the skip: option.

iex(31)> tokens = [
...(31)>   {:number, "\\d+"},
...(31)>   {:name, "\\w+"},
...(31)>   {:any, ".+"} ]
...(31)> parser = many(token_parser(tokens, flatten_ast: true, skip: "\\s+"))
...(31)> # "\\s*"  would work here but zero width matches are just so dangerous in parsing
...(31)> parse_string(parser, " 42 hello ,x")
{:ok, [number: "42", name: "hello", any: ",x"]}

N.B. Now the whitespace is also removed from the :any token

Mixing Regular Expressions and Parsers

If we want to structure a parser around token_parser even if everything cannot be expressed (or is not desired to be expressed) in a regular expression, we can simply replace the regular expression with another parser...

iex(32)> ab_parser = rgx_parser("(a+)(b+)") |> satisfy(fn [_, as, bs] -> String.length(as) == String.length(bs) end)
...(32)> tokens = [
...(32)>    abs: ab_parser,
...(32)>    other: ".*" ]
...(32)> parser = token_parser(tokens)
...(32)> assert parse_string(parser, "aabb") == {:ok, {:abs, ["aabb", "aa", "bb"]}}
...(32)> assert parse_string(parser, "aaabb") == {:ok, {:other, ["aaabb"]}}
...(32)> assert parse_string(parser, "aabbb") == {:ok, {:other, ["aabbb"]}}

Note on Performance and Style: Maybe ab_parser should have been written as shown below, for long inputs, but I do not think that this recursion, which I would have needed to express as the Y-Combinator inside a doctest would have made for a readable doctest. And also for many practical purposes a regular expression with a satisfy clause might be at least as performant as a complicated grammar.

    def ab_parser, do: select([sequence([char_parser("a"), lazy(fn -> ab_parser()), char_parser("b")], empty_parser())])

Essential Combinators

many

Parses an input if a parser can be applied many times to it. This means that many can parse an empty input unless a minimum count is specified:

iex(33)> a_parser = many(char_parser("a"))
...(33)> assert parse_string(a_parser, "") == {:ok, []}
...(33)> assert parse_string(a_parser, "aaa") == {:ok, ~W[a a a]}

But if we specify a min_count...

iex(34)> a_parser = many(char_parser("a"), "at least one", 1)
...(34)> assert parse_string(a_parser, "") == {:error, "Missing 1 parses in many (in at least one) in <binary>:1,1"}
...(34)> assert parse_string(a_parser, "aaa") == {:ok, ~W[a a a]}

Oftentimes many will be mapped to a string, like this:

  many(some_parser) |> map(&IO.chardata_to_string/) 

This can just be abbreviated to many_as_string(some_parser)

iex(35)> a_parser = many_as_string(char_parser("a"))
...(35)> parse_string(a_parser, "aaa")
{:ok, "aaa"}

But as IO.chardata is used, we can convert deeper structures too:

iex(36)> ab_parser = many_as_string(sequence([ 
...(36)>   many(char_parser("a"), nil, 1),
...(36)>   many(char_parser("b"), nil, 1)
...(36)> ]))
...(36)> parse_string(ab_parser, "aabbb")
{:ok, "aabbb"}

Be careful with nil values in your ast, as IO.chardata_to_String does not support them, use maybe_as_empty for these cases

iex(37)> one_or_two = many_as_string(sequence([ 
...(37)> char_parser("a"), maybe_as_empty(char_parser("a"))]))
...(37)> assert parse_string(one_or_two, "aa") == {:ok, "aa"}
...(37)> assert parse_string(one_or_two, "a") == {:ok, "a"}

satisfy

This creates parsers with constraints by applying a validation function to the ast of a successful parser invocations (fails are passed through of course)

The validation function can either return a tuple {:ok, new_ast}|{:error, :reason} or simply a truth value in which case the original ast will be maintained or a generic error message (in the case of the validation function returning false or nil) will be generated for the fail case.

As a consequence, very often in the case of production code, the tuple form will be the preferred result of the validation function

iex(38)> vowel_parser = char_parser()
...(38)> |> satisfy(&Enum.member?(~W[a], &1)) # The famous Restricted Vowel Set ;)
...(38)> assert parse_string(vowel_parser, "a") == {:ok, "a"}
...(38)> assert parse_string(vowel_parser, "b") == {:error, "satisfier char_parser() returned false (in char_parser()) in <binary>:1,1"}

It might be preferable to be clearer

iex(39)> vowel_parser = char_parser()
...(39)> |> satisfy(
...(39)> fn letter -> if Enum.member?(~W[a], letter), do: {:ok, :a}, else: {:error, "Not an A"} end,
...(39)> "restricted vowel parser")
...(39)> assert parse_string(vowel_parser, "a") == {:ok, :a}
...(39)> assert parse_string(vowel_parser, "b") == {:error, "Not an A (in restricted vowel parser) in <binary>:1,1"}

sequence

Takes a list of Parsers, only parses if all of them parse subsequently on the given input and return a list of the results of each parser.

Let us remimplement the keywords parser

iex(40)> if_parser = sequence([char_parser("i"), char_parser("f")])
...(40)> |> map(&Enum.join/1)
...(40)> ~w[if else]
...(40)> |> Enum.map(&parse_string(if_parser, &1))
[
ok: "if",
error: "e not member of \"i\" in char_parser(\"i\") (in char_parser(\"i\")) in <binary>:1,1"
]

This leads us directly to

map

Map, takes a parser and a mapping function. It returns a new parser that fails with exactly the same error message as its input parser, but succeeds with the result mapped by the mapping function.

iex(41)> list_parser = many(char_parser()) |> map(&Enum.join(&1, ", "))
...(41)> parse_string(list_parser, "abc")
{:ok, "a, b, c"}

Oftentimes we will want the position of a parsed string to be included into the ast. An obvious use case to identify where in the source a semantic error has occurred.

Enter ...

mapp

This example also demonstrates the ignore combinator which will be ignored in sequence

iex(42)> a_parser = sequence([ws_parser() |> ignore(), char_parser("a") |> mapp(&{&1, &2})])
...(42)> [parse_string(a_parser, "  a"), parse_string(a_parser, "")]
[
ok: [{"a", {3, 1}}],
error: "encountered end of input (in char_parser(\"a\")) in <binary>:1,1"
]

ignore does of course not mean that the input does not need to parse

iex(43)> a_parser = sequence([char_parser("b") |> ignore(), char_parser("a") |> mapp(&{&1, &2})])
...(43)> parse_string(a_parser, "a")
{
:error, "a not member of \"b\" in char_parser(\"b\") (in char_parser(\"b\")) in <binary>:1,1"
}

It is also quite normal to just append the position to the ast, so the above can also be written simpler with ...

with_pos

iex(44)> a_parser = sequence([ws_parser() |> ignore(), char_parser("a") |> with_pos()])
...(44)> [parse_string(a_parser, "  a"), parse_string(a_parser, "")]
[
ok: [{"a", {3, 1}}],
error: "encountered end of input (in char_parser(\"a\")) in <binary>:1,1"
]

select

Oftentimes select is (maybe better) named choice we have therefore define an alias for choice

iex(45)> vowel_parser = select(~W[a e i o u y] |> Enum.map(&char_parser/1), "vowel_parser")
...(45)> ~W[a u y x] |> Enum.map(&parse_string(vowel_parser, &1))
[
ok: "a",
ok: "u",
ok: "y",
error: "no alternative could be parsed in vowel_parser (in vowel_parser) in <binary>:1,1"
]

And the aliased option

iex(46)> parser = option([char_parser("a"), char_parser("b")], "option_parser")
...(46)> ~W[b x] |> Enum.map(&parse_string(parser, &1))
[
ok: "b",
error: "no alternative could be parsed in option_parser (in option_parser) in <binary>:1,1"
]

end

many_sel

Just a shortcut for many(select(...

iex(47)> parser = many_sel([char_parser("a"), char_parser("b")])
...(47)> parse_string(parser, "abba")
{:ok, ~W[a b b a]}

many_seq

Just a shortcut for many(sequence(...

iex(48)> parser = many_seq([char_parser("a"), char_parser("b")])
...(48)> parse_string(parser, "abab")
{:ok, [~W[a b], ~W[a b]]}

In the result we can see that it is important to take into consideration that this is indeed two nested parsers and one will often do things as the following

iex(49)> parser = many_seq([char_parser("a"), char_parser("b")]) |> map(&IO.chardata_to_string/1)
...(49)> parse_string(parser, "abab")
{:ok, "abab"}

Convenience Combinators

What about whitespace

Oftentimes whitespace shall be ignored in the resulting ast, and sometimes in the input too. To be more precise when whitespace stops parsing of example a keyword then the subsequent patser often is not interested in$ the left ofer ws preceeding its new input.

Enter ignore_ws

Here is the form that does not ignore newlines, which is the default:

iex(50)> next_char_parser = ignore_ws(char_parser())
...(50)> parse_string(next_char_parser, " \ta")
{:ok, "a"}

iex(51)> next_a_parser = ignore_ws(char_parser("a"))
...(51)> parse_string(next_a_parser, " \na")
{:error, "\n not member of \"a\" in char_parser(\"a\") (in char_parser(\"a\")) in <binary>:1,2"}

But we can also use the newline allowing version

iex(52)> next_a_parser = ignore_ws(char_parser("a"), "skip newlines", true)
...(52)> parse_string(next_a_parser, " \na")
{:ok, "a"}

upto_parser_parser

Oftentimes parsing algorithms become more read- and maintanable when we reparse a part of the input stream with a different parser. In order to be able to do this we can just parse up to a part of the input stream defined by a parser and return the input stream up to that point as a String

N.B. This convenience comes with a price, the parser will try to match for every position in the input stream until it succeeds or fails on an empty input. Hence use with care.

iex(53)> upto_end_parser = upto_parser_parser(keywords_parser(~W[end]))
...(53)> parse_string(upto_end_parser, "up to end")
{:ok, "up to "}

If however parser never succeeds the upto_parser_parser fails.

iex(54)> upto_end_parser = upto_parser_parser(keywords_parser(~W[end]))
...(54)> parse_string(upto_end_parser, "up to en")
{:error, "encountered end of input (in upto_parser_parser(keywords_parser([\"end\"]), keep)) in <binary>:1,9"}

We can also ask to include the ast from the parser into the result

iex(55)> upto_end_parser = upto_parser_parser(keywords_parser(~W[end]), "my parser", :include)
...(55)> parse_string(upto_end_parser, "up to end")
{:ok, {"up to ", "end"}}

Or to discard it, which is not the default case (which is :keep)

iex(56)> keep_parser = upto_parser_parser(keywords_parser(~W[end]), "my parser", :keep)
...(56)> discard_parser = upto_parser_parser(keywords_parser(~W[end]), "my other parser", :discard)
...(56)> [ parse(keep_parser, "up to end"), parse(discard_parser, "up to end")]
[
%Minipeg.Success{ast: "up to ", cache: %Minipeg.Cache{cache: %{}}, parsed_at: {7, 1}, parsed_by: "my parser", rest: %Minipeg.Input{input: "end", col: 7, lnb: 1}},
%Minipeg.Success{ast: "up to ", cache: %Minipeg.Cache{cache: %{}}, parsed_at: {7, 1}, parsed_by: "my other parser", rest: %Minipeg.Input{input: "", col: 10, lnb: 1}}
]

Error Handling...

has been enhanced a little bit in version 0.6.0, we have two new combinators that allow to make better error messages. Much can still be done I guess, however, as we will demonstrate now, with the map_error one can collect on the data in %Failure{}, but the Failure struct might provider a richer inteface maybe.

map_error

...

Definitions

Parser

A Parser is a struct that parses an Input struct (with the parse function) and either returns a Success or Failure struct

In order to abstract the internal representations of input and results the parse_string function is provided as shown in the exampleas above.

The Success struct contains the resulting Abstract Syntaxt Tree and the rest of the input as an Input struct.

The Failure struct contains the original Input struct and an error message

Internally a Cache is already returned (and passed into subsequent parse calls of the Parser module) but unless you are extending Minipeg itself by defining parsers by hand instead of using Combinators you can ignore this.

Combinator

A Combibator is a function that takes a Parser optionally some arguments and returns a new Parser