Canonicalization-aware BCP-47 locale negotiation and fallback (Phase 2).
This module is the pure, total, dependency-free engine behind erli18n's
opt-in locale-fallback chain and the Accept-Language negotiation helpers
exposed on the erli18n facade (negotiate/2, parse_accept_language/1,
canonicalize_locale/1). It holds no state: no gen_server, no ETS,
no process dictionary, no application:get_env. Every function runs in the
caller's process and is property-testable in isolation.
The problem it solves
erli18n catalogs are keyed by exact binary (<<"pt_BR">>). Two correctness
gaps follow from that:
- A
pt_BRuser with only aptcatalog loaded gets the rawmsgid(English) instead of Portuguese — there is no base-language fallback. - HTTP delivers hyphenated, mixed-case tags (
pt-BR,PT_br) and legacy subtags (iwfor Hebrew), none of which match the underscored catalog keypt_BR.
This module closes both: canonicalize/1 folds a tag to the catalog-key
shape, fallback_chain/2 builds the ordered candidate list to try, and
parse_accept_language/1 + negotiate/2,3 / best_match/3 pick the best
supported locale from a client preference list.
It does not change erli18n's default behavior. The facade only consults
this module after an exact-match miss and only when the application
env erli18n.locale_fallback is enabled (default off). The lock-free
exact-hit hot path is untouched.
Canonicalization (canonicalize/1)
Target shape = erli18n catalog key = underscore-joined, RFC 5646 §2.1.1
positional casing: language lowercase, script Titlecase, region UPPERCASE
(pt_BR, zh_Hant, zh_Hant_TW). The transform is:
- Strip a POSIX charset/modifier suffix (
pt_BR.UTF-8,ca_ES@valencia). - Treat
-and_as equivalent separators. - Case each subtag by position (language) and byte length (2 = region, 4 = script, else lowercase).
- Map a small, closed set of IANA-deprecated two-letter language codes to their preferred value, on the language subtag only.
It is idempotent (canonicalize(canonicalize(X)) =:= canonicalize(X))
and never raises on any binary content (an oversized or absurd tag is
returned unchanged).
Legacy-alias table (the complete, IN-scope set)
| Deprecated | Preferred | Language |
|---|---|---|
in | id | Indonesian |
iw | he | Hebrew |
ji | yi | Yiddish |
jw | jv | Javanese |
mo | ro | Moldovan → Romanian |
Out of scope (documented non-goals): sh (macrolanguage, no preferred
value), no/nb/nn (not deprecated), tl/fil, the script-vs-region
inference zh_Hans ⇄ zh_CN (needs the CLDR Add Likely Subtags
algorithm + data), and grandfathered/irregular tags (i-klingon). Those
pass through as ordinary (mis)canonicalized binaries that simply miss the
catalog — never special-cased.
Fallback chain (fallback_chain/2)
RFC 4647 §3.4 Lookup: canonicalize, then progressively drop the trailing
subtag, appending the (canonicalized) default last. pt-BR with default
en yields [<<"pt_BR">>, <<"pt">>, <<"en">>]. The chain is
order-preserving deduplicated and bounded. The facade walks it doing one
catalog read per candidate, short-circuiting on the first hit — so the cost
is O(chain length) extra reads only on a miss, zero on a hit.
Script subtags are kept during truncation (zh_Hant_TW → zh_Hant → zh),
matching RFC 4647 Lookup rather than CLDR's script-aware stop.
Accept-Language (parse_accept_language/1, best_match/3)
parse_accept_language/1 parses an HTTP Accept-Language header
(RFC 9110 §12.5.4) into [{Range, Q}] with Q as an integer in milli-units
(0..1000). Absent q is 1000; a well-formed q=0 entry is dropped
("not acceptable"); the list is sorted by descending Q with a stable
header-order tiebreak. The output shape matches cowlib's
cow_http_hd:parse_accept_language/1, but this parser is total/fail-soft
(it never crashes on malformed input — cowlib does).
best_match/3 / negotiate/2,3 run RFC 4647 Lookup of the (already
priority-ordered) preference list against the available catalog locales,
returning the first supported match (or a default / error).
Totality and anti-DoS
Consistent with erli18n_interp and erli18n_plural, the work is bounded
fail-closed and never interns untrusted text into atoms:
?MAX_TAG_BYTES(35) — a longer tag/range is returned unchanged / skipped.?MAX_SUBTAGS(8) — a tag with more subtags is returned unchanged.?MAX_CHAIN(8) — fallback chain length cap.?MAX_HEADER_BYTES(4096) — a longerAccept-Languageheader →[].?MAX_RAW_ELEMS(64) — comma-split element cap (RFC 9110 §5.6.1) →[].?MAX_RANGES(32) — accepted-range budget inparse_accept_language/1; per-consumed-cell budget (32 cells inspected max) into_locale_list/2.
No binary_to_atom/list_to_atom is used anywhere; locales stay binaries,
so a stream of distinct hostile tags cannot exhaust the atom table.
Quickstart
1> erli18n_negotiate:canonicalize(<<"pt-BR">>).
<<"pt_BR">>
2> erli18n_negotiate:canonicalize(<<"iw-IL">>).
<<"he_IL">>
3> erli18n_negotiate:fallback_chain(<<"pt-BR">>, <<"en">>).
[<<"pt_BR">>,<<"pt">>,<<"en">>]
4> erli18n_negotiate:parse_accept_language(<<"da, en-gb;q=0.8, en;q=0.7">>).
[{<<"da">>,1000},{<<"en-gb">>,800},{<<"en">>,700}]
5> erli18n_negotiate:negotiate([<<"pt-BR">>], [<<"pt">>, <<"en">>]).
{ok,<<"pt">>}
Summary
Types
A prebuilt canonical→original index of an available-locale set: maps
canonicalize(Original) to the original Available casing, first occurrence
winning. Produced by available_index/1 and consumed by
negotiate_with_index/2, so a caller negotiating many preference lists against
ONE available set builds the index once and reuses it.
An RFC 4647 language range as it appears on the wire in an Accept-Language
header (<<"en-gb">>); may be the wildcard <<"*">>. ASCII-lowercased by
parse_accept_language/1, hyphen-separated (NOT yet canonicalized).
A locale tag as a binary, in erli18n catalog-key shape after
canonicalization (<<"pt_BR">>, <<"zh_Hant">>). Same semantics as
erli18n_server:locale/0.
A quality value as an integer in milli-units, 0..1000 (q=1 → 1000,
q=0.8 → 800). Integer arithmetic avoids float parsing of untrusted text.
Functions
Builds the canonical→original index for an available-locale set, for reuse
across many negotiate_with_index/2 calls.
The bare RFC 4647 Lookup primitive: like negotiate/3 but returns the
matched (or Default) locale directly, never wrapped. Always succeeds
(falls to Default). Total.
Canonicalizes ONE BCP-47 / POSIX locale tag to erli18n catalog-key shape.
Builds the ordered, deduplicated RFC 4647 Lookup fallback chain for a
locale, ending in Default (canonicalized) unless Default =:= undefined.
Picks the best supported locale for a preference list, or error.
Like negotiate/2, but returns {ok, Default} instead of error when
nothing matches. Default is the caller's chosen floor (the RFC 4647
Lookup default) and is NOT validated against Available. Total.
Like negotiate/2, but against a PREBUILT available_index/1 instead of a raw
Available list — so the canonical index is built once and reused across many
preference lists (e.g. one per request source).
Builds an explicit-override fallback chain for {explicit, Map} mode: the
canonicalized Overrides list prefixed with canonicalize(Locale) and floored
with Default. Order-preserving deduplicated and bounded by the SAME
?MAX_CHAIN cap as fallback_chain/2. Total.
Parses an HTTP Accept-Language header into [{Range, Q}].
Types
A prebuilt canonical→original index of an available-locale set: maps
canonicalize(Original) to the original Available casing, first occurrence
winning. Produced by available_index/1 and consumed by
negotiate_with_index/2, so a caller negotiating many preference lists against
ONE available set builds the index once and reuses it.
-type language_range() :: binary().
An RFC 4647 language range as it appears on the wire in an Accept-Language
header (<<"en-gb">>); may be the wildcard <<"*">>. ASCII-lowercased by
parse_accept_language/1, hyphen-separated (NOT yet canonicalized).
-type locale() :: binary().
A locale tag as a binary, in erli18n catalog-key shape after
canonicalization (<<"pt_BR">>, <<"zh_Hant">>). Same semantics as
erli18n_server:locale/0.
-type qvalue() :: 0..1000.
A quality value as an integer in milli-units, 0..1000 (q=1 → 1000,
q=0.8 → 800). Integer arithmetic avoids float parsing of untrusted text.
Functions
-spec available_index([locale()]) -> available_index().
Builds the canonical→original index for an available-locale set, for reuse
across many negotiate_with_index/2 calls.
Maps canonicalize(A) to the original A for each A in Available, first
occurrence winning (so the earliest entry's original catalog casing is the one
returned by a later match). This is the per-Available work negotiate/2
otherwise repeats on every call; build it once when negotiating multiple
preference lists against the same set. Total.
1> Ix = erli18n_negotiate:available_index([<<"pt_BR">>, <<"fr">>]).
2> erli18n_negotiate:negotiate_with_index([<<"pt-BR">>], Ix).
{ok,<<"pt_BR">>}
The bare RFC 4647 Lookup primitive: like negotiate/3 but returns the
matched (or Default) locale directly, never wrapped. Always succeeds
(falls to Default). Total.
1> erli18n_negotiate:best_match([<<"en-US">>], [<<"en">>], <<"x">>).
<<"en">>
Canonicalizes ONE BCP-47 / POSIX locale tag to erli18n catalog-key shape.
Underscore-joined, RFC 5646 §2.1.1 positional casing (language lowercase, script Titlecase, region UPPERCASE), with a charset/modifier suffix stripped and a bounded legacy-language alias applied to the language subtag. Hyphen and underscore are equivalent on input.
Total and idempotent: any binary input returns a binary and re-running
produces the same result. A binary over ?MAX_TAG_BYTES, an empty binary,
or a tag with more than ?MAX_SUBTAGS subtags is returned UNCHANGED
(fail-soft). A non-binary argument is a programmer error and raises
function_clause (the contract is binary-in/binary-out).
1> erli18n_negotiate:canonicalize(<<"PT_br">>).
<<"pt_BR">>
2> erli18n_negotiate:canonicalize(<<"zh-hant-tw">>).
<<"zh_Hant_TW">>
3> erli18n_negotiate:canonicalize(<<"ca_ES@valencia">>).
<<"ca_ES">>
4> erli18n_negotiate:canonicalize(<<"iw">>).
<<"he">>See fallback_chain/2 (uses this) and the module doc for the alias table
and the documented non-goals (zh_Hans ⇄ zh_CN Likely Subtags).
Builds the ordered, deduplicated RFC 4647 Lookup fallback chain for a
locale, ending in Default (canonicalized) unless Default =:= undefined.
Locale is canonicalized first, then the trailing subtag is dropped
repeatedly to a fixpoint (zh_Hant_TW → zh_Hant → zh); Default is appended
last. The result is order-preserving deduplicated and capped at ?MAX_CHAIN.
The head is the most specific candidate. Total; the returned list is always
non-empty (at minimum [canonicalize(Locale)]).
1> erli18n_negotiate:fallback_chain(<<"pt-BR">>, <<"en">>).
[<<"pt_BR">>,<<"pt">>,<<"en">>]
2> erli18n_negotiate:fallback_chain(<<"zh_Hant_TW">>, <<"en">>).
[<<"zh_Hant_TW">>,<<"zh_Hant">>,<<"zh">>,<<"en">>]
3> erli18n_negotiate:fallback_chain(<<"en">>, undefined).
[<<"en">>]The facade walks this list with one catalog read per candidate, returning on
the first hit; this is what makes a pt_BR user fall back to a loaded pt
catalog. See canonicalize/1.
Picks the best supported locale for a preference list, or error.
Preferred is an ordered preference list (priority = position): either
[locale()] or the [{locale(), qvalue()}] output of
parse_accept_language/1 (the Q is ignored — order already encodes
priority, and q=0 ranges were already dropped). Available is the list of
catalog locales (e.g. erli18n:loaded_catalogs/0 locales).
Each Preferred entry is canonicalized and resolved through its
fallback_chain/2 (no default) against a canonical→original index of
Available; the FIRST hit wins. * ranges are skipped. The returned locale
is the ORIGINAL Available casing. Total.
1> erli18n_negotiate:negotiate([<<"pt-BR">>], [<<"pt">>, <<"en">>]).
{ok,<<"pt">>}
2> erli18n_negotiate:negotiate([<<"zh_Hant">>], [<<"en">>]).
errorSee negotiate/3 (default instead of error) and best_match/3.
Like negotiate/2, but returns {ok, Default} instead of error when
nothing matches. Default is the caller's chosen floor (the RFC 4647
Lookup default) and is NOT validated against Available. Total.
1> erli18n_negotiate:negotiate([<<"zh_Hant">>], [<<"en">>], <<"en">>).
{ok,<<"en">>}
-spec negotiate_with_index([locale()] | [{locale(), qvalue()}], available_index()) -> {ok, locale()} | error.
Like negotiate/2, but against a PREBUILT available_index/1 instead of a raw
Available list — so the canonical index is built once and reused across many
preference lists (e.g. one per request source).
negotiate(Preferred, Available) is exactly
negotiate_with_index(Preferred, available_index(Available)); this arity lets a
caller hoist the available_index/1 out of a per-candidate loop. Semantics are
otherwise identical: each Preferred entry is canonicalized and resolved through
its fallback_chain/2 against the index, first hit winning, returning the
original Available casing. Total.
1> Ix = erli18n_negotiate:available_index([<<"pt">>, <<"en">>]).
2> erli18n_negotiate:negotiate_with_index([<<"pt-BR">>], Ix).
{ok,<<"pt">>}
3> erli18n_negotiate:negotiate_with_index([<<"zh_Hant">>], Ix).
errorSee negotiate/2 (raw-list form) and available_index/1.
Builds an explicit-override fallback chain for {explicit, Map} mode: the
canonicalized Overrides list prefixed with canonicalize(Locale) and floored
with Default. Order-preserving deduplicated and bounded by the SAME
?MAX_CHAIN cap as fallback_chain/2. Total.
Exposed so the facade's explicit-map mode reuses one bounding/dedup implementation instead of re-deriving the cap.
1> erli18n_negotiate:override_chain(<<"de-AT">>, [<<"de">>], <<"en">>).
[<<"de_AT">>,<<"de">>,<<"en">>]
-spec parse_accept_language(binary()) -> [{language_range(), qvalue()}].
Parses an HTTP Accept-Language header into [{Range, Q}].
Range is the ASCII-lowercased, hyphen-separated language range as on the
wire (NOT canonicalized; may be <<"*">>); Q is an integer in milli-units
(0..1000). An absent q parameter means 1000; a well-formed q=0 entry
is DROPPED. The list is sorted by descending Q, ties broken by ascending
header position (stable).
Total and fail-soft: any malformed element is skipped, never crashing.
Returns [] on an empty header, a header over ?MAX_HEADER_BYTES, or one
with more than ?MAX_RAW_ELEMS comma elements. A non-binary argument raises
function_clause. At most ?MAX_RANGES ranges are returned.
The output shape matches cowlib's cow_http_hd:parse_accept_language/1, so a
Cowboy app may feed either source into negotiate/2. Unlike cowlib, this
parser never raises on hostile input.
1> erli18n_negotiate:parse_accept_language(<<"da, en-gb;q=0.8, en;q=0.7">>).
[{<<"da">>,1000},{<<"en-gb">>,800},{<<"en">>,700}]
2> erli18n_negotiate:parse_accept_language(<<"fr;q=0, de">>).
[{<<"de">>,1000}]See best_match/3 and negotiate/2,3.