erli18n observability surface: a thin wrapper over the :telemetry library
that centralizes the event names and shields call sites from the absence of
the optional dependency.
What it is and what problem it solves
telemetry is an optional dependency of erli18n (declared via
optional_applications, OTP 24+): the lib works with or without it. This
module is the only layer that knows this. It solves three problems for the
rest of the code base:
- Safe indirection. Call sites (
erli18n_server, the lookup hot path) callemit/3/span/3without ever testing whethertelemetryis present. When the lib is not loaded, both become no-ops — zero crash, zero noise — instead of scatteringcase code:ensure_loaded(...)everywhere. - Name contract. All erli18n event names live here, exposed as pre-typed
event_*/0functions. A rename or audit is a one-file change. The names are the public contract of observability (convention[<lib>, <operation>, <phase>], in the style ofPhoenix.Logger). - Overhead and security policy. The high-frequency lookup events
(
miss/fuzzy_skip) are opt-in (flagemit_lookup_telemetry, defaultfalse) — this minimizes both the cost and the risk of leaking msgid content in a multi-tenant scenario. Thememory_warningis rate-limited: at most one emission per configured window.
Mental model
Think of two layers, both lock-free from any process:
- Telemetry detection (sticky-positive cache). The first call performs
code:ensure_loaded(telemetry), which walks the code server. If it loads, thetrueresult is stored inpersistent_termand stays sticky for the rest of the VM's lifetime (telemetry does not unload at runtime). If it does not load, thefalseresult is not cached: that way, if the consumer brings telemetry up mid-flight (application:start(telemetry)), the next emission already sees it. The price of this choice is, at most, onecode:ensure_loaded/1per emission while telemetry is absent (microseconds), and zero per emission once present. - Configuration via
application:get_env/3. The flags (emit_lookup_telemetry,memory_warning_threshold,memory_warning_rate_limit_seconds) are read on every call — a direct read in the application controller's ETS (~100 ns). There is no per-process state and no caching of these flags.
Trusted vs untrusted: the rate-limit persistent_term key is private to
this module. The functions narrow its value at the boundary; if something
outside reuses the key and writes a non-integer, the code crashes explicitly
instead of operating on garbage. Invalid configuration values (non-boolean,
negative integer) also crash with {invalid_config, ...} — a loud, visible
failure, never silent.
When a dev touches this module
- Observability consumer (attaches handlers): use the
event_*/0names intelemetry:attach/4. Do not callemit/3/span/3directly — erli18n is what emits. - Core maintainer (
erli18n_server, hot path): callspan/3to instrument operations with start/stop (load/reload),emit/3for pointwise events, andlookup_telemetry_enabled/0to gate the lookup events before building expensive payloads. The loader callsmemory_warning_check/1.
Quickstart (consumer)
%% Attach a handler to the catalog-load events:
1> telemetry:attach_many(
.. <<"erli18n-log">>,
.. [erli18n_telemetry:event_catalog_load(),
.. erli18n_telemetry:event_catalog_load() ++ [stop]],
.. fun(Event, Measurements, Meta, _Cfg) ->
.. io:format("~p ~p ~p~n", [Event, Measurements, Meta])
.. end,
.. undefined).
ok
%% Lookup events are opt-in; enable them explicitly:
2> application:set_env(erli18n, emit_lookup_telemetry, true).
ok
3> erli18n_telemetry:lookup_telemetry_enabled().
trueKey functions
- Emission:
emit/3(pointwise),span/3(start/stop/exception). - Event names:
event_catalog_load/0,event_catalog_reload/0,event_catalog_unload/0,event_lookup_miss/0,event_lookup_fuzzy_skip/0,event_plural_divergence/0,event_catalog_memory_warning/0. - Configuration/gating:
lookup_telemetry_enabled/0,memory_warning_threshold/0,memory_warning_rate_limit_seconds/0,memory_warning_check/1.
References
- Library: https://github.com/beam-telemetry/telemetry
- Hexdocs: https://hexdocs.pm/telemetry/
span/3: https://hexdocs.pm/telemetry/telemetry.html#span-3- Naming convention
[<lib>, <operation>, <phase>]: https://hexdocs.pm/phoenix/Phoenix.Logger.html persistent_term(lock-free, copy-free across processes): https://www.erlang.org/doc/man/persistent_term.html
Summary
Types
Name of a telemetry event: a list of atoms in the format
[<lib>, <operation>, <phase>] (e.g. [erli18n, catalog, load]). It is the
type returned by all event_*/0 functions and the one accepted by
emit/3/span/3. The list contains the atoms of the erli18n vocabulary and
admits a free atom() in the tail for extensions (e.g. the start/stop
suffix that span/3 appends).
Map of an event's numeric measurements (e.g. #{duration => N},
#{ets_bytes => N}). Structurally it is just a map(); the telemetry
convention is that measurements are aggregable values, distinct from
qualitative metadata.
Map of an event's qualitative metadata (e.g. domain, locale,
domain_locales_sample sample). Structurally it is just a map(); it carries
context, not aggregable values.
Body of a span: a fun/0 that must return {Result, StopMetadata}, per the
contract of telemetry:span/3. Result is propagated back by span/3;
StopMetadata is merged into the stop event's metadata (or discarded on the
no-op path, when telemetry is absent).
Return value of span/3: the Result produced by span_fun/0.
Functions
Emits a pointwise telemetry event (no start/stop semantics; for that use
span/3).
Event prefix of a catalog's load span (ensure_loaded):
[erli18n, catalog, load]. Since it is a span prefix (via span/3), the
events actually emitted have the start/stop/exception suffix appended.
Name of the memory warning event: [erli18n, catalog, memory_warning].
Emitted by memory_warning_check/1 when the catalogs' ETS usage crosses
memory_warning_threshold/0, rate-limited to at most one emission per
memory_warning_rate_limit_seconds/0. Always on (does not go through the
lookup flag).
Event prefix of a catalog's atomic reload span:
[erli18n, catalog, reload]. As a span prefix, it receives the
start/stop/exception suffix at runtime.
Name of the pointwise catalog unload event:
[erli18n, catalog, unload]. Emitted via emit/3 (not a span).
Name of the fuzzy entry skip event in lookup (an entry marked
#, fuzzy in the .po, which gettext ignores): [erli18n, lookup, fuzzy_skip].
A high-frequency event, opt-in under the same flag as the misses
(lookup_telemetry_enabled/0).
Name of the lookup miss event (key not found in the catalog):
[erli18n, lookup, miss]. A high-frequency event and therefore opt-in
— only emitted when lookup_telemetry_enabled/0 returns true. Keeping the
default off also avoids exposing msgid content in a multi-tenant scenario.
Name of the plural divergence warning event:
[erli18n, plural, divergence_warning]. Emitted at load time when the
Plural-Forms rule in the .po header diverges from the CLDR rule inlined for
the locale (an informative validation — the .po header remains the source of
truth at runtime). Always on (does not go through the lookup flag).
Gate for the high-frequency lookup events (event_lookup_miss/0 and
event_lookup_fuzzy_skip/0). Call sites call this function before building
expensive payloads, so that the overhead only exists when the operator opts in.
Inspects the MemInfo memory snapshot and emits at most one
event_catalog_memory_warning/0, deciding among not-warning, suppressing by
rate-limit, or warning. Called by the loader (erli18n_server) at the end of a
successful load.
Window, in seconds, between successive emissions of
event_catalog_memory_warning/0. Even if the threshold is crossed on every
load, memory_warning_check/1 only re-emits after this window has elapsed
since the last emission (mitigation: "once per crossing event, not on every
tick").
Threshold, in bytes, of the catalogs' ETS usage above which
event_catalog_memory_warning/0 becomes eligible. Compared against ets_bytes
inside memory_warning_check/1 with a strict > (equaling the threshold does
not fire).
Test-only: erases the two persistent_term keys of this module — the sticky
"telemetry loaded" cache (?LOADED_KEY) and the memory_warning rate-limit
anchor (?MEM_WARN_LAST_KEY) — simulating a fresh VM between test cases. It is
not part of the documented API surface (do not rely on it in production). It
always returns ok.
Runs Fun instrumented as a telemetry span, following the contract of
telemetry:span/3 (events with start, stop, and exception).
Types
-type event_name() :: [erli18n | catalog | lookup | plural | load | reload | unload | miss | fuzzy_skip | divergence_warning | memory_warning | atom()].
Name of a telemetry event: a list of atoms in the format
[<lib>, <operation>, <phase>] (e.g. [erli18n, catalog, load]). It is the
type returned by all event_*/0 functions and the one accepted by
emit/3/span/3. The list contains the atoms of the erli18n vocabulary and
admits a free atom() in the tail for extensions (e.g. the start/stop
suffix that span/3 appends).
-type measurements() :: map().
Map of an event's numeric measurements (e.g. #{duration => N},
#{ets_bytes => N}). Structurally it is just a map(); the telemetry
convention is that measurements are aggregable values, distinct from
qualitative metadata.
-type metadata() :: map().
Map of an event's qualitative metadata (e.g. domain, locale,
domain_locales_sample sample). Structurally it is just a map(); it carries
context, not aggregable values.
Body of a span: a fun/0 that must return {Result, StopMetadata}, per the
contract of telemetry:span/3. Result is propagated back by span/3;
StopMetadata is merged into the stop event's metadata (or discarded on the
no-op path, when telemetry is absent).
-type span_result() :: term().
Return value of span/3: the Result produced by span_fun/0.
Functions
-spec emit(event_name(), measurements(), metadata()) -> ok.
Emits a pointwise telemetry event (no start/stop semantics; for that use
span/3).
Parameters:
EventName— the event name, typically one of theevent_*/0(e.g.event_catalog_unload/0). Must be a list.Measurements— map of numeric/aggregable measurements. Must be a map.Metadata— map of qualitative metadata. Must be a map.
Behavior and return: if telemetry is loaded (see the sticky detection in the
moduledoc), it delegates to telemetry:execute/3; otherwise it is a safe
no-op. On both paths it always returns ok — the result of
telemetry:execute/3 is discarded on purpose.
Failure modes: the clause is guarded (is_list/is_map/is_map); calling
with the wrong types results in function_clause (caller crash). The
erlang:apply(telemetry, execute, ...) indirection is intentional: it
makes dialyzer treat the call as an unknown remote function when telemetry
is genuinely absent from the PLT, mirroring the runtime story.
1> erli18n_telemetry:emit(
.. erli18n_telemetry:event_catalog_unload(),
.. #{count => 1},
.. #{domain => my_domain, locale => <<"fr">>}).
okThe no-op path does not depend on telemetry being loaded in memory, but
on telemetry being absent from the code path — detection (see
telemetry_loaded/0 / moduledoc) uses code:ensure_loaded(telemetry), which
would load the module from the code path if it existed there. In other words:
code:is_loaded(telemetry) =:= false does not make emit/3 a no-op (the
module would still be loaded and the event emitted). The no-op only occurs when
the telemetry app is not in the release/code path; in that scenario the same
call returns ok without emitting anything.
Sibling: span/3 (events with start/stop).
-spec event_catalog_load() -> event_name().
Event prefix of a catalog's load span (ensure_loaded):
[erli18n, catalog, load]. Since it is a span prefix (via span/3), the
events actually emitted have the start/stop/exception suffix appended.
1> erli18n_telemetry:event_catalog_load().
[erli18n,catalog,load]Siblings: event_catalog_reload/0, event_catalog_unload/0.
-spec event_catalog_memory_warning() -> event_name().
Name of the memory warning event: [erli18n, catalog, memory_warning].
Emitted by memory_warning_check/1 when the catalogs' ETS usage crosses
memory_warning_threshold/0, rate-limited to at most one emission per
memory_warning_rate_limit_seconds/0. Always on (does not go through the
lookup flag).
1> erli18n_telemetry:event_catalog_memory_warning().
[erli18n,catalog,memory_warning]Emitter: memory_warning_check/1.
-spec event_catalog_reload() -> event_name().
Event prefix of a catalog's atomic reload span:
[erli18n, catalog, reload]. As a span prefix, it receives the
start/stop/exception suffix at runtime.
1> erli18n_telemetry:event_catalog_reload().
[erli18n,catalog,reload]Siblings: event_catalog_load/0, event_catalog_unload/0.
-spec event_catalog_unload() -> event_name().
Name of the pointwise catalog unload event:
[erli18n, catalog, unload]. Emitted via emit/3 (not a span).
1> erli18n_telemetry:event_catalog_unload().
[erli18n,catalog,unload]Siblings: event_catalog_load/0, event_catalog_reload/0.
-spec event_lookup_fuzzy_skip() -> event_name().
Name of the fuzzy entry skip event in lookup (an entry marked
#, fuzzy in the .po, which gettext ignores): [erli18n, lookup, fuzzy_skip].
A high-frequency event, opt-in under the same flag as the misses
(lookup_telemetry_enabled/0).
1> erli18n_telemetry:event_lookup_fuzzy_skip().
[erli18n,lookup,fuzzy_skip]Sibling: event_lookup_miss/0. Gate: lookup_telemetry_enabled/0.
-spec event_lookup_miss() -> event_name().
Name of the lookup miss event (key not found in the catalog):
[erli18n, lookup, miss]. A high-frequency event and therefore opt-in
— only emitted when lookup_telemetry_enabled/0 returns true. Keeping the
default off also avoids exposing msgid content in a multi-tenant scenario.
1> erli18n_telemetry:event_lookup_miss().
[erli18n,lookup,miss]Sibling: event_lookup_fuzzy_skip/0. Gate: lookup_telemetry_enabled/0.
-spec event_plural_divergence() -> event_name().
Name of the plural divergence warning event:
[erli18n, plural, divergence_warning]. Emitted at load time when the
Plural-Forms rule in the .po header diverges from the CLDR rule inlined for
the locale (an informative validation — the .po header remains the source of
truth at runtime). Always on (does not go through the lookup flag).
1> erli18n_telemetry:event_plural_divergence().
[erli18n,plural,divergence_warning]
-spec lookup_telemetry_enabled() -> boolean().
Gate for the high-frequency lookup events (event_lookup_miss/0 and
event_lookup_fuzzy_skip/0). Call sites call this function before building
expensive payloads, so that the overhead only exists when the operator opts in.
Reads the app env emit_lookup_telemetry (default false — opt-in, also for
multi-tenant security reasons). The read is a direct access to the application
controller's ETS (~100 ns); this function does not eliminate the overhead
of looking up the flag itself, only that of having handlers attached — it is
the theoretical limit of the design.
Return and failure modes: true for true, false for false. Any other
configured value is a configuration error and triggers an explicit crash with
error({invalid_config, {erli18n, emit_lookup_telemetry, Other, expected, boolean}}) — a loud, visible failure, never a silent "treat as false".
1> erli18n_telemetry:lookup_telemetry_enabled().
false
2> application:set_env(erli18n, emit_lookup_telemetry, true).
ok
3> erli18n_telemetry:lookup_telemetry_enabled().
true
4> application:set_env(erli18n, emit_lookup_telemetry, "yes").
ok
5> erli18n_telemetry:lookup_telemetry_enabled().
** exception error: {invalid_config,{erli18n,emit_lookup_telemetry,"yes",expected,boolean}}Siblings (config): memory_warning_threshold/0,
memory_warning_rate_limit_seconds/0.
-spec memory_warning_check(map()) -> not_warned | rate_limited | warned.
Inspects the MemInfo memory snapshot and emits at most one
event_catalog_memory_warning/0, deciding among not-warning, suppressing by
rate-limit, or warning. Called by the loader (erli18n_server) at the end of a
successful load.
Parameter:
MemInfo— a snapshot map. The keys read areets_bytes(ETS usage, the trigger; default0if absent),num_catalogsandnum_keys(only used in the measurement when warning; default0). Must be a map, otherwisefunction_clause.
Decision logic:
- If
ets_bytesis not>memory_warning_threshold/0, returnsnot_warned(strict>comparison). - Otherwise, if the
memory_warning_rate_limit_seconds/0window has not yet elapsed since the last emission, returnsrate_limitedwithout emitting. - Otherwise, writes the current instant to the anchor, builds the sample and
emits via
emit/3, returningwarned.
Side effects: the rate-limit anchor is a private key in persistent_term
(lock-free from any process), updated only on an actual emission.
Rewriting the key via persistent_term:put/2 may trigger GC work proportional
to the processes that still hold references to the previous value of this
key — not an unconditional global full GC of the VM. Here that is cheap (the
previous value is a single timestamp integer, with no long-lived holders) and,
moreover, it only happens on the warned path (rare, by design), so the cost
is acceptable. The payload of the warned event has:
- measurements
#{ets_bytes, threshold_bytes, num_catalogs, num_keys}; - metadata
#{domain_locales_sample => [...]}, a sample of up to 10{Domain, Locale}pairs (payload bound in a multi-tenant deployment), collected bycollect_domain_locales_sample/0.
Failure modes: if ets_bytes or the counters are non-numeric, the > or the
construction of the measurements crash. If the persistent_term anchor holds a
non-integer (someone reusing the private key — a contract violation), the
boundary crashes with {invalid_persistent_term, ...} instead of operating on
garbage.
%% Below the default threshold (100 MiB): nothing happens.
1> erli18n_telemetry:memory_warning_check(#{ets_bytes => 1024}).
not_warned
%% Above the threshold: the first call warns...
2> erli18n_telemetry:memory_warning_check(
.. #{ets_bytes => 209715200, num_catalogs => 3, num_keys => 4096}).
warned
%% ...and the next one, within the rate-limit window, is suppressed.
3> erli18n_telemetry:memory_warning_check(#{ets_bytes => 209715200}).
rate_limitedConfig: memory_warning_threshold/0, memory_warning_rate_limit_seconds/0.
Event: event_catalog_memory_warning/0. In tests, reset_caches/0 zeroes the
anchor.
-spec memory_warning_rate_limit_seconds() -> non_neg_integer().
Window, in seconds, between successive emissions of
event_catalog_memory_warning/0. Even if the threshold is crossed on every
load, memory_warning_check/1 only re-emits after this window has elapsed
since the last emission (mitigation: "once per crossing event, not on every
tick").
Reads the app env memory_warning_rate_limit_seconds (default 60).
Return and failure modes: a valid non_neg_integer(). A value that is not an
integer >= 0 triggers a crash with error({invalid_config, {erli18n, memory_warning_rate_limit_seconds, Other, expected, non_neg_integer}}). A
value of 0 makes every crossing re-emit (a degenerate window, with no
effective rate limit).
1> erli18n_telemetry:memory_warning_rate_limit_seconds().
60
2> application:set_env(erli18n, memory_warning_rate_limit_seconds, 300).
ok
3> erli18n_telemetry:memory_warning_rate_limit_seconds().
300Consumer: memory_warning_check/1. Sibling: memory_warning_threshold/0.
-spec memory_warning_threshold() -> non_neg_integer().
Threshold, in bytes, of the catalogs' ETS usage above which
event_catalog_memory_warning/0 becomes eligible. Compared against ets_bytes
inside memory_warning_check/1 with a strict > (equaling the threshold does
not fire).
Reads the app env memory_warning_threshold (default 104857600, 100 MiB).
Return and failure modes: a valid non_neg_integer(). Any value that is not an
integer >= 0 (negative, non-integer) triggers a crash with
error({invalid_config, {erli18n, memory_warning_threshold, Other, expected, non_neg_integer}}).
1> erli18n_telemetry:memory_warning_threshold().
104857600
2> application:set_env(erli18n, memory_warning_threshold, 52428800).
ok
3> erli18n_telemetry:memory_warning_threshold().
52428800
4> application:set_env(erli18n, memory_warning_threshold, -1).
ok
5> erli18n_telemetry:memory_warning_threshold().
** exception error: {invalid_config,{erli18n,memory_warning_threshold,-1,expected,non_neg_integer}}Consumer: memory_warning_check/1. Sibling: memory_warning_rate_limit_seconds/0.
-spec reset_caches() -> ok.
Test-only: erases the two persistent_term keys of this module — the sticky
"telemetry loaded" cache (?LOADED_KEY) and the memory_warning rate-limit
anchor (?MEM_WARN_LAST_KEY) — simulating a fresh VM between test cases. It is
not part of the documented API surface (do not rely on it in production). It
always returns ok.
Useful for making deterministic the tests of memory_warning_check/1 (which
switches from warned to rate_limited depending on the anchor) and those of
telemetry detection.
1> erli18n_telemetry:reset_caches().
ok
-spec span(event_name(), metadata(), span_fun()) -> span_result().
Runs Fun instrumented as a telemetry span, following the contract of
telemetry:span/3 (events with start, stop, and exception).
Parameters:
EventPrefix— the event prefix (e.g.event_catalog_load/0). Telemetry appendsstart/stop/exceptionto this prefix. Must be a list.StartMetadata— metadata already available in thestartevent (and merged intostop). Must be a map.Fun— the span body, a fun/0 that MUST return{Result, StopMetadata}(seespan_fun/0).
Contract semantics (path with telemetry loaded): emits
EventPrefix ++ [start] with measurements #{monotonic_time, system_time};
runs Fun; emits EventPrefix ++ [stop] with #{monotonic_time, duration}
and StartMetadata merged with StopMetadata. If Fun raises an exception,
it emits EventPrefix ++ [exception] (with #{kind, reason, stacktrace} in
the metadata) instead of stop, and the exception re-propagates. It delegates
to telemetry:span/3 to keep the measurements byte-equal to what :telemetry
users expect.
No-op path semantics (telemetry absent): it still runs Fun — otherwise
the lib would behave differently with vs without telemetry, which is
unacceptable — and discards StopMetadata (there is nowhere to emit it). No
event is emitted.
Return: on both paths, the Result produced by Fun (see span_result/0).
Failure modes: guarded clause (is_list/is_map/is_function(Fun, 0)); wrong
types => function_clause. If Fun does not return a {Result, StopMetadata}
tuple, both paths crash, but asymmetrically with respect to the events
already emitted:
- No-op path (telemetry absent): crashes with
badmatchat{Result, _StopMetadata} = Fun()before any emission — no event goes out (consistent with the no-op never emitting anything). - Path with telemetry:
telemetry:span/3has already emitted theEventPrefix ++ [start]event before inspectingFun's return, so the consumer sees an orphanstart(without a matchingstoporexception) followed by the crash inside thetelemetrylib itself when matching the invalid shape. This is exactly the symptom to look for when debuggingstartevents without astop.
1> erli18n_telemetry:span(
.. erli18n_telemetry:event_catalog_load(),
.. #{domain => my_domain, locale => <<"fr">>},
.. fun() ->
.. Result = do_load(), %% instrumented work
.. {Result, #{entries => 128}} %% {Result, StopMetadata}
.. end).
ResultSibling: emit/3 (pointwise events).