Parser and serializer for the GNU gettext PO/POT format.
Reads a .po/.pot catalog (text) and returns a structured parsed_catalog();
dump/1 is the inverse path. All the logic is hand-rolled recursive descent,
dependency-free, honoring the nine PO-semantics decisions (PSD-001..009).
What it does and what problem it solves
Turns the raw bytes of a .po into data the rest of the library consumes
(erli18n_server calls this module at the start of the load pipeline). The nine
decisions in one sentence each:
- PSD-001:
#, fuzzyentries are dropped by default (parity withmsgfmt). PSD-002: the
Content-Typecharset is normalized toutf8 | latin1 | us_ascii.- PSD-003: an empty translation (
<<>>) is preserved; the fallback is the responsibility of whoever does the lookup, not the parser. - PSD-004:
Plural-Formsis preserved raw; onlynpluralsis extracted here. - PSD-005: a UTF-8 BOM is stripped silently before any processing.
- PSD-006:
msgctxtis a separate field, never byte-glued to themsgid. - PSD-007: obsolete entries (
#~) are dropped. - PSD-008: a degenerate plural (
nplurals=1) is accepted;validate_plural_indices/3treatsnplurals=1as a valid index set ([0]), parity with the Asian rules (ja/zh/ko/vi/th). - PSD-009: the
msgstr[N]index set is validated againstnplurals.
Mental model
This module is PURE and STATELESS: no ETS, no process dictionary,
no application:env. Each parse/2 call carries only the binary you
passed; parse_file/2 just prepends a file:read_file/1. Errors
become data ({error, parse_error()}), not dead processes.
The input is UNTRUSTED (the multi-tenant threat model in SECURITY.md): a
tenant may upload an adversarial .po. Hence the contract is "parsing
errors become structured errors, never silent crashes nor unbounded
memory growth". Two concrete defenses live here:
- A cap by digit COUNT before any
binary_to_integerover attacker input (?MAX_INT_DIGITS), at the two sites that read integers from the.po: thenplurals=of the header (collect_digits/2) and themsgstr[N]index (parse_msgstr_index/2). Without it, a run of thousands of digits would build an O(d^2) bignum or hitsystem_limit. bins_to_binary/1materializes large strings in LINEAR time (left-side accumulator +iolist_to_binary/1); the naive form with the right-side accumulator was Θ(n²) and stalled the loader for seconds on a single largemsgid.
The parse/2 pipeline is TWO-PASS, because the body charset is only
known after reading the header:
- A prepass (
extract_header_charset/1) reads the raw bytes (the header is always ASCII-safe per the GNU spec) and discovers the charset. normalize_input/2transcodes the entire body to UTF-8 in that charset.- The line-by-line parse runs over UTF-8, with the charset still threaded so
\\xHH/\\OOOescapes are interpreted in the declared code space BEFORE the UTF-8 gate (two-phase decode,decode_quoted_string/2+reassemble_field/2). Prepass and builder use the SAME field reconciler (field_charset/1), so they never diverge (that divergence was a badmatch that took down the gen_server on aContent-Typewith a space before the:).
LF, CRLF and lone-CR (classic Mac) line endings are all accepted.
When you touch this module
- Loading a catalog:
erli18n_serverreads the file on its own (file:read_file/1) and callsparse/2underneath —parse_file/1,2is a convenience/test helper, NOT the production path. You rarely call it directly. - Validating/inspecting a
.poin a tool or test:parse/1orparse/2. - Roundtrip / programmatic rewrite:
parse/1-> edit ->dump/1.
Quickstart
1> Po = <<"msgid \"\"\n"
.. "msgstr \"Content-Type: text/plain; charset=UTF-8\\n\"\n"
.. "\n"
.. "msgid \"Hello\"\n"
.. "msgstr \"Ola\"\n">>.
2> {ok, Catalog} = erli18n_po:parse(Po).
3> maps:get(entries, Catalog).
[{singular,undefined,<<"Hello">>,<<"Ola">>}]
4> maps:get(charset, maps:get(header, Catalog)).
utf8
5> erli18n_po:parse(erli18n_po:dump(Catalog)) =:= {ok, Catalog}.
trueKey functions
Input: parse/1, parse/2, parse_file/1, parse_file/2. Output: dump/1.
Result type: parsed_catalog/0; an entry is an entry/0; errors are a
parse_error/0.
Summary
Types
A catalog entry, in one of two shapes.
Catalog header, already reconciled.
Structured parse error — the only "normal" failure mode of the public API.
Parse options. Today there is only one key: include_fuzzy.
Result of a successful parse.
Functions
Serializes a parsed_catalog() back to PO text (a UTF-8 binary).
Parses a PO catalog from a binary, with default options
(include_fuzzy => false).
Parses a PO catalog from a binary, honoring Opts.
Reads and parses a .po file from disk, with default options.
Reads and parses a .po file from disk, honoring Opts.
Types
-type context() :: undefined | binary().
-type entry() :: {singular, context(), msgid(), translation()} | {plural, context(), msgid(), msgid_plural(), [{plural_index(), translation()}]}.
A catalog entry, in one of two shapes.
{singular, Context, Msgid, Translation} — a 1:1 translation. Context is
undefined (no msgctxt) or a binary. Translation may be <<>> (PSD-003:
the empty value is preserved, it does not become a fallback here).
{plural, Context, Msgid, MsgidPlural, Forms} — a translation with plurals.
MsgidPlural is the plural form from the source or undefined (degenerate
case: only msgstr[N] without an explicit msgid_plural). Forms is a list
[{plural_index(), translation()}] ORDERED by index, validated against
nplurals (PSD-009).
-type file_read_error() :: file:posix() | badarg | terminated | system_limit.
-type header_map() :: #{plural_forms => binary(), content_type => binary(), charset => utf8 | latin1 | us_ascii, raw => binary()}.
Catalog header, already reconciled.
charset is the normalized atom (PSD-002). plural_forms is the RAW string of
the Plural-Forms field (PSD-004): this module does NOT evaluate it — only
erli18n_plural does; here it is preserved for downstream. content_type is
the raw value of the field of the same name. raw is the entire msgstr text
of the header, used by dump/1 to re-emit the header faithfully. A catalog
without a header of its own gets a synthetic header with charset => utf8 and
the other fields empty.
-type msgid() :: binary().
-type msgid_plural() :: undefined | binary().
-type parse_error() :: {unsupported_charset, binary()} | {charset_conversion, binary(), term()} | {plural_count_mismatch, msgid(), Expected :: non_neg_integer(), Got :: [non_neg_integer()]} | {syntax_error, Line :: pos_integer(), Reason :: term()} | {file_error, file_read_error()}.
Structured parse error — the only "normal" failure mode of the public API.
{unsupported_charset, Declared}— theContent-Typedeclared a charset that does not map toutf8 | latin1 | us_ascii.{charset_conversion, Label, Detail}— the bytes do not match the declared charset (e.g. invalid UTF-8, a byte outside US-ASCII).{plural_count_mismatch, Msgid, Expected, Got}— themsgstr[N]indices do not form exactly[0..Expected-1](PSD-009).{syntax_error, Line, Reason}— malformed line;Reasonisterm()and also carries the escape-decode errors (e.g.escape_invalid_utf8,octal_escape_out_of_range) without widening the exported tuple.{file_error, Posix}— onlyparse_file/1,2: the disk read failed.
-type parse_opts() :: #{include_fuzzy => boolean()}.
Parse options. Today there is only one key: include_fuzzy.
With include_fuzzy => false (default), entries marked #, fuzzy are
dropped on flush (parity with msgfmt). With true, they are kept. An
empty map #{} inherits all the defaults.
-type parsed_catalog() :: #{header := header_map(), entries := [entry()]}.
Result of a successful parse.
header is the header_map() (always present: synthesized empty if the .po
had no header of its own). entries is in FILE ORDER. It is exactly the
shape that dump/1 consumes.
The roundtrip law parse(dump(C)) =:= {ok, C} holds for catalogs whose header
was parsed from a .po WITH a header of its own (raw =/= <<>>). When the
catalog came from an input WITHOUT a header (a synthetic header with
raw => <<>> and content_type => <<>>), dump/1 materializes a minimal
Content-Type; on re-parse, that field becomes populated and the catalog
differs from the original at that point. See dump/1 for the detail.
-type plural_index() :: non_neg_integer().
-type translation() :: binary().
Functions
-spec dump(parsed_catalog()) -> binary().
Serializes a parsed_catalog() back to PO text (a UTF-8 binary).
Emits the header block first (msgid "" / msgstr "" plus the header raw,
or a minimal header Content-Type: text/plain; charset=UTF-8 when the raw is
empty or absent) and then each entry. singular entries produce
msgctxt/msgid/msgstr; plural entries re-emit the retained
msgid_plural (finding #14 — when it is undefined, the singular msgid
is used as a stand-in) and one msgstr[N] line per form. The strings are
re-escaped (\\\\, \\", \\n, \\t, \\r) so that parse(dump(C))
preserves the catalog. A total function: it always returns a binary().
Parameter: Catalog must be a valid parsed_catalog() (#{header := _, entries := _}) — typically the {ok, Catalog} from parse/2. A map without
the header/entries keys triggers function_clause (contract: it only
consumes what parse/2 produces). Each entry() must have the singular/plural
shape; a tuple of any other shape falls through dump_entry/1 and crashes.
The minimal synthetic header is NOT emitted with the Content-Type glued onto
the msgstr line: dump_header_text/1 always emits msgstr "" and dumps the
header body as quoted CONTINUATION LINES (encode_header_line/1). So the actual
output for a catalog WITHOUT a header of its own is:
1> {ok, C} = erli18n_po:parse(<<"msgid \"Hi\"\nmsgstr \"Oi\"\n">>).
2> erli18n_po:dump(C).
<<"msgid \"\"\nmsgstr \"\"\n"
"\"Content-Type: text/plain; charset=UTF-8\\n\"\n\n"
"msgid \"Hi\"\nmsgstr \"Oi\"\n\n">>WATCH OUT for the roundtrip: the .po above has no header, so C carries a
synthetic header (raw => <<>>, content_type => <<>>). dump/1 injects a
minimal Content-Type; on re-parse, that field is no longer empty, so the
catalog differs from the original and equality is FALSE:
3> erli18n_po:parse(erli18n_po:dump(C)) =:= {ok, C}.
falseThe law parse(dump(C)) =:= {ok, C} only holds for catalogs that ALREADY had a
header of their own (raw =/= <<>>) — exactly the case of the quickstart in the
moduledoc, which does return true.
Inverse path of parse/1 / parse/2. See parsed_catalog/0 and entry/0.
-spec parse(binary()) -> {ok, parsed_catalog()} | {error, parse_error()}.
Parses a PO catalog from a binary, with default options
(include_fuzzy => false).
Equivalent to parse(Bin, #{}). Returns {ok, parsed_catalog()} with the
normalized header and the list of entries (in file order), or
{error, parse_error()} if the charset is invalid, the conversion fails, there
is a syntax error, or the plural indices diverge from nplurals.
1> erli18n_po:parse(<<"msgid \"Hello\"\nmsgstr \"Ola\"\n">>).
{ok,#{header => #{charset => utf8,content_type => <<>>,
plural_forms => <<>>,raw => <<>>},
entries => [{singular,undefined,<<"Hello">>,<<"Ola">>}]}}See parse/2 for the full semantics of options and the pipeline, and dump/1
for the inverse path.
-spec parse(binary(), parse_opts()) -> {ok, parsed_catalog()} | {error, parse_error()}.
Parses a PO catalog from a binary, honoring Opts.
Bin is the raw content of the .po; Opts is a parse_opts() — today only
include_fuzzy => boolean() (default false: entries marked #, fuzzy are
dropped, parity with msgfmt). The flow: (1) silent strip of the UTF-8 BOM
(PSD-005); (2) a prepass that extracts the charset from the Content-Type
header via the same field reconciler as build_header/1, ensuring that prepass
and builder never diverge (finding #5 — closes the badmatch on a
Content-Type with a space before the :); (3) normalizes the entire body to
UTF-8 in the discovered charset; (4) line-by-line parse with the charset
threaded so \\xHH/\\OOO escapes are transcoded through the right code space.
Returns {ok, parsed_catalog()} (#{header => header_map(), entries => [entry()]}) or {error, parse_error()}. Without an explicit header, it
synthesizes an empty header with charset utf8. Accepts LF, CRLF and lone-CR
line endings (finding #15).
Parameters:
Bin— raw content of the.po/.pot. Treated as UNTRUSTED: annplurals=ormsgstr[N]with an absurd run of digits is rejected in O(1) (cap by?MAX_INT_DIGITS), never builds a bignum.Opts— seeparse_opts/0.include_fuzzycontrols whether#, fuzzyentries enter the result.
Failure modes (all {error, parse_error()}, never a crash): an unsupported
declared charset, bytes that do not match the charset, plural indices that
diverge from nplurals, and malformed lines (with line number).
1> Fuzzy = <<"#, fuzzy\nmsgid \"a\"\nmsgstr \"b\"\n">>.
2> {ok, C0} = erli18n_po:parse(Fuzzy, #{}).
3> maps:get(entries, C0).
[]
4> {ok, C1} = erli18n_po:parse(Fuzzy, #{include_fuzzy => true}).
5> maps:get(entries, C1).
[{singular,undefined,<<"a">>,<<"b">>}]
6> erli18n_po:parse(<<"msgid \"a\"\nmsgstr \"b\"\n???\n">>).
{error,{syntax_error,3,{unrecognized_line,<<"???">>}}}See parse/1 (defaults), parse_file/2 (from disk) and dump/1.
-spec parse_file(file:filename()) -> {ok, parsed_catalog()} | {error, parse_error()}.
Reads and parses a .po file from disk, with default options.
Equivalent to parse_file(Path, #{}). Reads Path with file:read_file/1 and
delegates to parse/2. Read errors become {error, {file_error, file_read_error()}}.
1> erli18n_po:parse_file(<<"priv/locale/fr/LC_MESSAGES/my_domain.po">>).
{ok,#{header => #{charset => utf8, ...}, entries => [...]}}
2> erli18n_po:parse_file(<<"/does/not/exist.po">>).
{error,{file_error,enoent}}See parse_file/2 (with options) and parse/2 (the parse semantics themselves).
-spec parse_file(file:filename(), parse_opts()) -> {ok, parsed_catalog()} | {error, parse_error()}.
Reads and parses a .po file from disk, honoring Opts.
Reads Path with file:read_file/1; on success it delegates the binary to
parse/2 with Opts (see parse/2 for the semantics of the options and the
return). If the read fails, it returns {error, {file_error, Posix}}, where
Posix ranges over file:posix() | badarg | terminated | system_limit.
Parameters:
Path— file path, anyfile:filename().Opts— passed untouched toparse/2; seeparse_opts/0.
The only difference from parse/2 is the read phase: I/O errors become
{error, {file_error, Posix}}; everything already read follows exactly the
rules of parse/2.
1> erli18n_po:parse_file(<<"catalog.po">>, #{include_fuzzy => true}).
{ok,#{header => #{...}, entries => [...]}}See parse_file/1 (defaults) and parse/2.