erli18n_po (erli18n v0.1.0)

Copy Markdown View Source

Parser and serializer for the GNU gettext PO/POT format.

Reads a .po/.pot catalog (text) and returns a structured parsed_catalog(); dump/1 is the inverse path. All the logic is hand-rolled recursive descent, dependency-free, honoring the nine PO-semantics decisions (PSD-001..009).

What it does and what problem it solves

Turns the raw bytes of a .po into data the rest of the library consumes (erli18n_server calls this module at the start of the load pipeline). The nine decisions in one sentence each:

  • PSD-001: #, fuzzy entries are dropped by default (parity with msgfmt).
  • PSD-002: the Content-Type charset is normalized to utf8 | latin1 | us_ascii.

  • PSD-003: an empty translation (<<>>) is preserved; the fallback is the responsibility of whoever does the lookup, not the parser.
  • PSD-004: Plural-Forms is preserved raw; only nplurals is extracted here.
  • PSD-005: a UTF-8 BOM is stripped silently before any processing.
  • PSD-006: msgctxt is a separate field, never byte-glued to the msgid.
  • PSD-007: obsolete entries (#~) are dropped.
  • PSD-008: a degenerate plural (nplurals=1) is accepted; validate_plural_indices/3 treats nplurals=1 as a valid index set ([0]), parity with the Asian rules (ja/zh/ko/vi/th).
  • PSD-009: the msgstr[N] index set is validated against nplurals.

Mental model

This module is PURE and STATELESS: no ETS, no process dictionary, no application:env. Each parse/2 call carries only the binary you passed; parse_file/2 just prepends a file:read_file/1. Errors become data ({error, parse_error()}), not dead processes.

The input is UNTRUSTED (the multi-tenant threat model in SECURITY.md): a tenant may upload an adversarial .po. Hence the contract is "parsing errors become structured errors, never silent crashes nor unbounded memory growth". Two concrete defenses live here:

  • A cap by digit COUNT before any binary_to_integer over attacker input (?MAX_INT_DIGITS), at the two sites that read integers from the .po: the nplurals= of the header (collect_digits/2) and the msgstr[N] index (parse_msgstr_index/2). Without it, a run of thousands of digits would build an O(d^2) bignum or hit system_limit.
  • bins_to_binary/1 materializes large strings in LINEAR time (left-side accumulator + iolist_to_binary/1); the naive form with the right-side accumulator was Θ(n²) and stalled the loader for seconds on a single large msgid.

The parse/2 pipeline is TWO-PASS, because the body charset is only known after reading the header:

  1. A prepass (extract_header_charset/1) reads the raw bytes (the header is always ASCII-safe per the GNU spec) and discovers the charset.
  2. normalize_input/2 transcodes the entire body to UTF-8 in that charset.
  3. The line-by-line parse runs over UTF-8, with the charset still threaded so \\xHH/\\OOO escapes are interpreted in the declared code space BEFORE the UTF-8 gate (two-phase decode, decode_quoted_string/2 + reassemble_field/2). Prepass and builder use the SAME field reconciler (field_charset/1), so they never diverge (that divergence was a badmatch that took down the gen_server on a Content-Type with a space before the :).

LF, CRLF and lone-CR (classic Mac) line endings are all accepted.

When you touch this module

  • Loading a catalog: erli18n_server reads the file on its own (file:read_file/1) and calls parse/2 underneath — parse_file/1,2 is a convenience/test helper, NOT the production path. You rarely call it directly.
  • Validating/inspecting a .po in a tool or test: parse/1 or parse/2.
  • Roundtrip / programmatic rewrite: parse/1 -> edit -> dump/1.

Quickstart

1> Po = <<"msgid \"\"\n"
..          "msgstr \"Content-Type: text/plain; charset=UTF-8\\n\"\n"
..          "\n"
..          "msgid \"Hello\"\n"
..          "msgstr \"Ola\"\n">>.
2> {ok, Catalog} = erli18n_po:parse(Po).
3> maps:get(entries, Catalog).
[{singular,undefined,<<"Hello">>,<<"Ola">>}]
4> maps:get(charset, maps:get(header, Catalog)).
utf8
5> erli18n_po:parse(erli18n_po:dump(Catalog)) =:= {ok, Catalog}.
true

Key functions

Input: parse/1, parse/2, parse_file/1, parse_file/2. Output: dump/1. Result type: parsed_catalog/0; an entry is an entry/0; errors are a parse_error/0.

Summary

Types

A catalog entry, in one of two shapes.

Catalog header, already reconciled.

Structured parse error — the only "normal" failure mode of the public API.

Parse options. Today there is only one key: include_fuzzy.

Result of a successful parse.

Functions

Serializes a parsed_catalog() back to PO text (a UTF-8 binary).

Parses a PO catalog from a binary, with default options (include_fuzzy => false).

Parses a PO catalog from a binary, honoring Opts.

Reads and parses a .po file from disk, with default options.

Reads and parses a .po file from disk, honoring Opts.

Types

context()

-type context() :: undefined | binary().

entry()

-type entry() ::
          {singular, context(), msgid(), translation()} |
          {plural, context(), msgid(), msgid_plural(), [{plural_index(), translation()}]}.

A catalog entry, in one of two shapes.

{singular, Context, Msgid, Translation} — a 1:1 translation. Context is undefined (no msgctxt) or a binary. Translation may be <<>> (PSD-003: the empty value is preserved, it does not become a fallback here).

{plural, Context, Msgid, MsgidPlural, Forms} — a translation with plurals. MsgidPlural is the plural form from the source or undefined (degenerate case: only msgstr[N] without an explicit msgid_plural). Forms is a list [{plural_index(), translation()}] ORDERED by index, validated against nplurals (PSD-009).

file_read_error()

-type file_read_error() :: file:posix() | badarg | terminated | system_limit.

header_map()

-type header_map() ::
          #{plural_forms => binary(),
            content_type => binary(),
            charset => utf8 | latin1 | us_ascii,
            raw => binary()}.

Catalog header, already reconciled.

charset is the normalized atom (PSD-002). plural_forms is the RAW string of the Plural-Forms field (PSD-004): this module does NOT evaluate it — only erli18n_plural does; here it is preserved for downstream. content_type is the raw value of the field of the same name. raw is the entire msgstr text of the header, used by dump/1 to re-emit the header faithfully. A catalog without a header of its own gets a synthetic header with charset => utf8 and the other fields empty.

msgid()

-type msgid() :: binary().

msgid_plural()

-type msgid_plural() :: undefined | binary().

parse_error()

-type parse_error() ::
          {unsupported_charset, binary()} |
          {charset_conversion, binary(), term()} |
          {plural_count_mismatch, msgid(), Expected :: non_neg_integer(), Got :: [non_neg_integer()]} |
          {syntax_error, Line :: pos_integer(), Reason :: term()} |
          {file_error, file_read_error()}.

Structured parse error — the only "normal" failure mode of the public API.

  • {unsupported_charset, Declared} — the Content-Type declared a charset that does not map to utf8 | latin1 | us_ascii.
  • {charset_conversion, Label, Detail} — the bytes do not match the declared charset (e.g. invalid UTF-8, a byte outside US-ASCII).
  • {plural_count_mismatch, Msgid, Expected, Got} — the msgstr[N] indices do not form exactly [0..Expected-1] (PSD-009).
  • {syntax_error, Line, Reason} — malformed line; Reason is term() and also carries the escape-decode errors (e.g. escape_invalid_utf8, octal_escape_out_of_range) without widening the exported tuple.
  • {file_error, Posix} — only parse_file/1,2: the disk read failed.

parse_opts()

-type parse_opts() :: #{include_fuzzy => boolean()}.

Parse options. Today there is only one key: include_fuzzy.

With include_fuzzy => false (default), entries marked #, fuzzy are dropped on flush (parity with msgfmt). With true, they are kept. An empty map #{} inherits all the defaults.

parsed_catalog()

-type parsed_catalog() :: #{header := header_map(), entries := [entry()]}.

Result of a successful parse.

header is the header_map() (always present: synthesized empty if the .po had no header of its own). entries is in FILE ORDER. It is exactly the shape that dump/1 consumes.

The roundtrip law parse(dump(C)) =:= {ok, C} holds for catalogs whose header was parsed from a .po WITH a header of its own (raw =/= <<>>). When the catalog came from an input WITHOUT a header (a synthetic header with raw => <<>> and content_type => <<>>), dump/1 materializes a minimal Content-Type; on re-parse, that field becomes populated and the catalog differs from the original at that point. See dump/1 for the detail.

plural_index()

-type plural_index() :: non_neg_integer().

translation()

-type translation() :: binary().

Functions

dump/1

-spec dump(parsed_catalog()) -> binary().

Serializes a parsed_catalog() back to PO text (a UTF-8 binary).

Emits the header block first (msgid "" / msgstr "" plus the header raw, or a minimal header Content-Type: text/plain; charset=UTF-8 when the raw is empty or absent) and then each entry. singular entries produce msgctxt/msgid/msgstr; plural entries re-emit the retained msgid_plural (finding #14 — when it is undefined, the singular msgid is used as a stand-in) and one msgstr[N] line per form. The strings are re-escaped (\\\\, \\", \\n, \\t, \\r) so that parse(dump(C)) preserves the catalog. A total function: it always returns a binary().

Parameter: Catalog must be a valid parsed_catalog() (#{header := _, entries := _}) — typically the {ok, Catalog} from parse/2. A map without the header/entries keys triggers function_clause (contract: it only consumes what parse/2 produces). Each entry() must have the singular/plural shape; a tuple of any other shape falls through dump_entry/1 and crashes.

The minimal synthetic header is NOT emitted with the Content-Type glued onto the msgstr line: dump_header_text/1 always emits msgstr "" and dumps the header body as quoted CONTINUATION LINES (encode_header_line/1). So the actual output for a catalog WITHOUT a header of its own is:

1> {ok, C} = erli18n_po:parse(<<"msgid \"Hi\"\nmsgstr \"Oi\"\n">>).
2> erli18n_po:dump(C).
<<"msgid \"\"\nmsgstr \"\"\n"
  "\"Content-Type: text/plain; charset=UTF-8\\n\"\n\n"
  "msgid \"Hi\"\nmsgstr \"Oi\"\n\n">>

WATCH OUT for the roundtrip: the .po above has no header, so C carries a synthetic header (raw => <<>>, content_type => <<>>). dump/1 injects a minimal Content-Type; on re-parse, that field is no longer empty, so the catalog differs from the original and equality is FALSE:

3> erli18n_po:parse(erli18n_po:dump(C)) =:= {ok, C}.
false

The law parse(dump(C)) =:= {ok, C} only holds for catalogs that ALREADY had a header of their own (raw =/= <<>>) — exactly the case of the quickstart in the moduledoc, which does return true.

Inverse path of parse/1 / parse/2. See parsed_catalog/0 and entry/0.

parse(Bin)

-spec parse(binary()) -> {ok, parsed_catalog()} | {error, parse_error()}.

Parses a PO catalog from a binary, with default options (include_fuzzy => false).

Equivalent to parse(Bin, #{}). Returns {ok, parsed_catalog()} with the normalized header and the list of entries (in file order), or {error, parse_error()} if the charset is invalid, the conversion fails, there is a syntax error, or the plural indices diverge from nplurals.

1> erli18n_po:parse(<<"msgid \"Hello\"\nmsgstr \"Ola\"\n">>).
{ok,#{header => #{charset => utf8,content_type => <<>>,
                  plural_forms => <<>>,raw => <<>>},
      entries => [{singular,undefined,<<"Hello">>,<<"Ola">>}]}}

See parse/2 for the full semantics of options and the pipeline, and dump/1 for the inverse path.

parse(Bin, Opts)

-spec parse(binary(), parse_opts()) -> {ok, parsed_catalog()} | {error, parse_error()}.

Parses a PO catalog from a binary, honoring Opts.

Bin is the raw content of the .po; Opts is a parse_opts() — today only include_fuzzy => boolean() (default false: entries marked #, fuzzy are dropped, parity with msgfmt). The flow: (1) silent strip of the UTF-8 BOM (PSD-005); (2) a prepass that extracts the charset from the Content-Type header via the same field reconciler as build_header/1, ensuring that prepass and builder never diverge (finding #5 — closes the badmatch on a Content-Type with a space before the :); (3) normalizes the entire body to UTF-8 in the discovered charset; (4) line-by-line parse with the charset threaded so \\xHH/\\OOO escapes are transcoded through the right code space.

Returns {ok, parsed_catalog()} (#{header => header_map(), entries => [entry()]}) or {error, parse_error()}. Without an explicit header, it synthesizes an empty header with charset utf8. Accepts LF, CRLF and lone-CR line endings (finding #15).

Parameters:

  • Bin — raw content of the .po/.pot. Treated as UNTRUSTED: an nplurals= or msgstr[N] with an absurd run of digits is rejected in O(1) (cap by ?MAX_INT_DIGITS), never builds a bignum.
  • Opts — see parse_opts/0. include_fuzzy controls whether #, fuzzy entries enter the result.

Failure modes (all {error, parse_error()}, never a crash): an unsupported declared charset, bytes that do not match the charset, plural indices that diverge from nplurals, and malformed lines (with line number).

1> Fuzzy = <<"#, fuzzy\nmsgid \"a\"\nmsgstr \"b\"\n">>.
2> {ok, C0} = erli18n_po:parse(Fuzzy, #{}).
3> maps:get(entries, C0).
[]
4> {ok, C1} = erli18n_po:parse(Fuzzy, #{include_fuzzy => true}).
5> maps:get(entries, C1).
[{singular,undefined,<<"a">>,<<"b">>}]
6> erli18n_po:parse(<<"msgid \"a\"\nmsgstr \"b\"\n???\n">>).
{error,{syntax_error,3,{unrecognized_line,<<"???">>}}}

See parse/1 (defaults), parse_file/2 (from disk) and dump/1.

parse_file(Path)

-spec parse_file(file:filename()) -> {ok, parsed_catalog()} | {error, parse_error()}.

Reads and parses a .po file from disk, with default options.

Equivalent to parse_file(Path, #{}). Reads Path with file:read_file/1 and delegates to parse/2. Read errors become {error, {file_error, file_read_error()}}.

1> erli18n_po:parse_file(<<"priv/locale/fr/LC_MESSAGES/my_domain.po">>).
{ok,#{header => #{charset => utf8, ...}, entries => [...]}}
2> erli18n_po:parse_file(<<"/does/not/exist.po">>).
{error,{file_error,enoent}}

See parse_file/2 (with options) and parse/2 (the parse semantics themselves).

parse_file(Path, Opts)

-spec parse_file(file:filename(), parse_opts()) -> {ok, parsed_catalog()} | {error, parse_error()}.

Reads and parses a .po file from disk, honoring Opts.

Reads Path with file:read_file/1; on success it delegates the binary to parse/2 with Opts (see parse/2 for the semantics of the options and the return). If the read fails, it returns {error, {file_error, Posix}}, where Posix ranges over file:posix() | badarg | terminated | system_limit.

Parameters:

  • Path — file path, any file:filename().
  • Opts — passed untouched to parse/2; see parse_opts/0.

The only difference from parse/2 is the read phase: I/O errors become {error, {file_error, Posix}}; everything already read follows exactly the rules of parse/2.

1> erli18n_po:parse_file(<<"catalog.po">>, #{include_fuzzy => true}).
{ok,#{header => #{...}, entries => [...]}}

See parse_file/1 (defaults) and parse/2.