erli18n_negotiate (erli18n v0.5.0)

Copy Markdown View Source

Canonicalization-aware BCP-47 locale negotiation and fallback (Phase 2).

This module is the pure, total, dependency-free engine behind erli18n's opt-in locale-fallback chain and the Accept-Language negotiation helpers exposed on the erli18n facade (negotiate/2, parse_accept_language/1, canonicalize_locale/1). It holds no state: no gen_server, no ETS, no process dictionary, no application:get_env. Every function runs in the caller's process and is property-testable in isolation.

The problem it solves

erli18n catalogs are keyed by exact binary (<<"pt_BR">>). Two correctness gaps follow from that:

  1. A pt_BR user with only a pt catalog loaded gets the raw msgid (English) instead of Portuguese — there is no base-language fallback.
  2. HTTP delivers hyphenated, mixed-case tags (pt-BR, PT_br) and legacy subtags (iw for Hebrew), none of which match the underscored catalog key pt_BR.

This module closes both: canonicalize/1 folds a tag to the catalog-key shape, fallback_chain/2 builds the ordered candidate list to try, and parse_accept_language/1 + negotiate/2,3 / best_match/3 pick the best supported locale from a client preference list.

It does not change erli18n's default behavior. The facade only consults this module after an exact-match miss and only when the application env erli18n.locale_fallback is enabled (default off). The lock-free exact-hit hot path is untouched.

Canonicalization (canonicalize/1)

Target shape = erli18n catalog key = underscore-joined, RFC 5646 §2.1.1 positional casing: language lowercase, script Titlecase, region UPPERCASE (pt_BR, zh_Hant, zh_Hant_TW). The transform is:

  • Strip a POSIX charset/modifier suffix (pt_BR.UTF-8, ca_ES@valencia).
  • Treat - and _ as equivalent separators.
  • Case each subtag by position (language) and byte length (2 = region, 4 = script, else lowercase).
  • Map a small, closed set of IANA-deprecated two-letter language codes to their preferred value, on the language subtag only.

It is idempotent (canonicalize(canonicalize(X)) =:= canonicalize(X)) and never raises on any binary content (an oversized or absurd tag is returned unchanged).

Legacy-alias table (the complete, IN-scope set)

DeprecatedPreferredLanguage
inidIndonesian
iwheHebrew
jiyiYiddish
jwjvJavanese
moroMoldovan → Romanian

Out of scope (documented non-goals): sh (macrolanguage, no preferred value), no/nb/nn (not deprecated), tl/fil, the script-vs-region inference zh_Hanszh_CN (needs the CLDR Add Likely Subtags algorithm + data), and grandfathered/irregular tags (i-klingon). Those pass through as ordinary (mis)canonicalized binaries that simply miss the catalog — never special-cased.

Fallback chain (fallback_chain/2)

RFC 4647 §3.4 Lookup: canonicalize, then progressively drop the trailing subtag, appending the (canonicalized) default last. pt-BR with default en yields [<<"pt_BR">>, <<"pt">>, <<"en">>]. The chain is order-preserving deduplicated and bounded. The facade walks it doing one catalog read per candidate, short-circuiting on the first hit — so the cost is O(chain length) extra reads only on a miss, zero on a hit.

Script subtags are kept during truncation (zh_Hant_TW → zh_Hant → zh), matching RFC 4647 Lookup rather than CLDR's script-aware stop.

Accept-Language (parse_accept_language/1, best_match/3)

parse_accept_language/1 parses an HTTP Accept-Language header (RFC 9110 §12.5.4) into [{Range, Q}] with Q as an integer in milli-units (0..1000). Absent q is 1000; a well-formed q=0 entry is dropped ("not acceptable"); the list is sorted by descending Q with a stable header-order tiebreak. The output shape matches cowlib's cow_http_hd:parse_accept_language/1, but this parser is total/fail-soft (it never crashes on malformed input — cowlib does).

best_match/3 / negotiate/2,3 run RFC 4647 Lookup of the (already priority-ordered) preference list against the available catalog locales, returning the first supported match (or a default / error).

Totality and anti-DoS

Consistent with erli18n_interp and erli18n_plural, the work is bounded fail-closed and never interns untrusted text into atoms:

  • ?MAX_TAG_BYTES (35) — a longer tag/range is returned unchanged / skipped.
  • ?MAX_SUBTAGS (8) — a tag with more subtags is returned unchanged.
  • ?MAX_CHAIN (8) — fallback chain length cap.
  • ?MAX_HEADER_BYTES (4096) — a longer Accept-Language header → [].
  • ?MAX_RAW_ELEMS (64) — comma-split element cap (RFC 9110 §5.6.1) → [].
  • ?MAX_RANGES (32) — accepted-range budget.

No binary_to_atom/list_to_atom is used anywhere; locales stay binaries, so a stream of distinct hostile tags cannot exhaust the atom table.

Quickstart

1> erli18n_negotiate:canonicalize(<<"pt-BR">>).
<<"pt_BR">>
2> erli18n_negotiate:canonicalize(<<"iw-IL">>).
<<"he_IL">>
3> erli18n_negotiate:fallback_chain(<<"pt-BR">>, <<"en">>).
[<<"pt_BR">>,<<"pt">>,<<"en">>]
4> erli18n_negotiate:parse_accept_language(<<"da, en-gb;q=0.8, en;q=0.7">>).
[{<<"da">>,1000},{<<"en-gb">>,800},{<<"en">>,700}]
5> erli18n_negotiate:negotiate([<<"pt-BR">>], [<<"pt">>, <<"en">>]).
{ok,<<"pt">>}

Summary

Types

An RFC 4647 language range as it appears on the wire in an Accept-Language header (<<"en-gb">>); may be the wildcard <<"*">>. ASCII-lowercased by parse_accept_language/1, hyphen-separated (NOT yet canonicalized).

A locale tag as a binary, in erli18n catalog-key shape after canonicalization (<<"pt_BR">>, <<"zh_Hant">>). Same semantics as erli18n_server:locale/0.

A quality value as an integer in milli-units, 0..1000 (q=11000, q=0.8800). Integer arithmetic avoids float parsing of untrusted text.

Functions

The bare RFC 4647 Lookup primitive: like negotiate/3 but returns the matched (or Default) locale directly, never wrapped. Always succeeds (falls to Default). Total.

Canonicalizes ONE BCP-47 / POSIX locale tag to erli18n catalog-key shape.

Builds the ordered, deduplicated RFC 4647 Lookup fallback chain for a locale, ending in Default (canonicalized) unless Default =:= undefined.

Picks the best supported locale for a preference list, or error.

Like negotiate/2, but returns {ok, Default} instead of error when nothing matches. Default is the caller's chosen floor (the RFC 4647 Lookup default) and is NOT validated against Available. Total.

Builds an explicit-override fallback chain for {explicit, Map} mode: the canonicalized Overrides list prefixed with canonicalize(Locale) and floored with Default. Order-preserving deduplicated and bounded by the SAME ?MAX_CHAIN cap as fallback_chain/2. Total.

Parses an HTTP Accept-Language header into [{Range, Q}].

Types

language_range()

-type language_range() :: binary().

An RFC 4647 language range as it appears on the wire in an Accept-Language header (<<"en-gb">>); may be the wildcard <<"*">>. ASCII-lowercased by parse_accept_language/1, hyphen-separated (NOT yet canonicalized).

locale()

-type locale() :: binary().

A locale tag as a binary, in erli18n catalog-key shape after canonicalization (<<"pt_BR">>, <<"zh_Hant">>). Same semantics as erli18n_server:locale/0.

qvalue()

-type qvalue() :: 0..1000.

A quality value as an integer in milli-units, 0..1000 (q=11000, q=0.8800). Integer arithmetic avoids float parsing of untrusted text.

Functions

best_match(Preferred, Available, Default)

-spec best_match([locale()] | [{locale(), qvalue()}], [locale()], locale()) -> locale().

The bare RFC 4647 Lookup primitive: like negotiate/3 but returns the matched (or Default) locale directly, never wrapped. Always succeeds (falls to Default). Total.

1> erli18n_negotiate:best_match([<<"en-US">>], [<<"en">>], <<"x">>).
<<"en">>

canonicalize(Tag)

-spec canonicalize(binary()) -> binary().

Canonicalizes ONE BCP-47 / POSIX locale tag to erli18n catalog-key shape.

Underscore-joined, RFC 5646 §2.1.1 positional casing (language lowercase, script Titlecase, region UPPERCASE), with a charset/modifier suffix stripped and a bounded legacy-language alias applied to the language subtag. Hyphen and underscore are equivalent on input.

Total and idempotent: any binary input returns a binary and re-running produces the same result. A binary over ?MAX_TAG_BYTES, an empty binary, or a tag with more than ?MAX_SUBTAGS subtags is returned UNCHANGED (fail-soft). A non-binary argument is a programmer error and raises function_clause (the contract is binary-in/binary-out).

1> erli18n_negotiate:canonicalize(<<"PT_br">>).
<<"pt_BR">>
2> erli18n_negotiate:canonicalize(<<"zh-hant-tw">>).
<<"zh_Hant_TW">>
3> erli18n_negotiate:canonicalize(<<"ca_ES@valencia">>).
<<"ca_ES">>
4> erli18n_negotiate:canonicalize(<<"iw">>).
<<"he">>

See fallback_chain/2 (uses this) and the module doc for the alias table and the documented non-goals (zh_Hanszh_CN Likely Subtags).

fallback_chain(Locale, Default)

-spec fallback_chain(locale(), locale() | undefined) -> [locale(), ...].

Builds the ordered, deduplicated RFC 4647 Lookup fallback chain for a locale, ending in Default (canonicalized) unless Default =:= undefined.

Locale is canonicalized first, then the trailing subtag is dropped repeatedly to a fixpoint (zh_Hant_TW → zh_Hant → zh); Default is appended last. The result is order-preserving deduplicated and capped at ?MAX_CHAIN. The head is the most specific candidate. Total; the returned list is always non-empty (at minimum [canonicalize(Locale)]).

1> erli18n_negotiate:fallback_chain(<<"pt-BR">>, <<"en">>).
[<<"pt_BR">>,<<"pt">>,<<"en">>]
2> erli18n_negotiate:fallback_chain(<<"zh_Hant_TW">>, <<"en">>).
[<<"zh_Hant_TW">>,<<"zh_Hant">>,<<"zh">>,<<"en">>]
3> erli18n_negotiate:fallback_chain(<<"en">>, undefined).
[<<"en">>]

The facade walks this list with one catalog read per candidate, returning on the first hit; this is what makes a pt_BR user fall back to a loaded pt catalog. See canonicalize/1.

negotiate(Preferred, Available)

-spec negotiate([locale()] | [{locale(), qvalue()}], [locale()]) -> {ok, locale()} | error.

Picks the best supported locale for a preference list, or error.

Preferred is an ordered preference list (priority = position): either [locale()] or the [{locale(), qvalue()}] output of parse_accept_language/1 (the Q is ignored — order already encodes priority, and q=0 ranges were already dropped). Available is the list of catalog locales (e.g. erli18n:loaded_catalogs/0 locales).

Each Preferred entry is canonicalized and resolved through its fallback_chain/2 (no default) against a canonical→original index of Available; the FIRST hit wins. * ranges are skipped. The returned locale is the ORIGINAL Available casing. Total.

1> erli18n_negotiate:negotiate([<<"pt-BR">>], [<<"pt">>, <<"en">>]).
{ok,<<"pt">>}
2> erli18n_negotiate:negotiate([<<"zh_Hant">>], [<<"en">>]).
error

See negotiate/3 (default instead of error) and best_match/3.

negotiate(Preferred, Available, Default)

-spec negotiate([locale()] | [{locale(), qvalue()}], [locale()], locale()) -> {ok, locale()}.

Like negotiate/2, but returns {ok, Default} instead of error when nothing matches. Default is the caller's chosen floor (the RFC 4647 Lookup default) and is NOT validated against Available. Total.

1> erli18n_negotiate:negotiate([<<"zh_Hant">>], [<<"en">>], <<"en">>).
{ok,<<"en">>}

override_chain(Locale, Overrides, Default)

-spec override_chain(locale(), [locale()], locale() | undefined) -> [locale(), ...].

Builds an explicit-override fallback chain for {explicit, Map} mode: the canonicalized Overrides list prefixed with canonicalize(Locale) and floored with Default. Order-preserving deduplicated and bounded by the SAME ?MAX_CHAIN cap as fallback_chain/2. Total.

Exposed so the facade's explicit-map mode reuses one bounding/dedup implementation instead of re-deriving the cap.

1> erli18n_negotiate:override_chain(<<"de-AT">>, [<<"de">>], <<"en">>).
[<<"de_AT">>,<<"de">>,<<"en">>]

parse_accept_language(Bin)

-spec parse_accept_language(binary()) -> [{language_range(), qvalue()}].

Parses an HTTP Accept-Language header into [{Range, Q}].

Range is the ASCII-lowercased, hyphen-separated language range as on the wire (NOT canonicalized; may be <<"*">>); Q is an integer in milli-units (0..1000). An absent q parameter means 1000; a well-formed q=0 entry is DROPPED. The list is sorted by descending Q, ties broken by ascending header position (stable).

Total and fail-soft: any malformed element is skipped, never crashing. Returns [] on an empty header, a header over ?MAX_HEADER_BYTES, or one with more than ?MAX_RAW_ELEMS comma elements. A non-binary argument raises function_clause. At most ?MAX_RANGES ranges are returned.

The output shape matches cowlib's cow_http_hd:parse_accept_language/1, so a Cowboy app may feed either source into negotiate/2. Unlike cowlib, this parser never raises on hostile input.

1> erli18n_negotiate:parse_accept_language(<<"da, en-gb;q=0.8, en;q=0.7">>).
[{<<"da">>,1000},{<<"en-gb">>,800},{<<"en">>,700}]
2> erli18n_negotiate:parse_accept_language(<<"fr;q=0, de">>).
[{<<"de">>,1000}]

See best_match/3 and negotiate/2,3.