erli18n_telemetry (erli18n v0.1.0)

Copy Markdown View Source

erli18n observability surface: a thin wrapper over the :telemetry library that centralizes the event names and shields call sites from the absence of the optional dependency.

What it is and what problem it solves

telemetry is an optional dependency of erli18n (declared via optional_applications, OTP 24+): the lib works with or without it. This module is the only layer that knows this. It solves three problems for the rest of the code base:

  • Safe indirection. Call sites (erli18n_server, the lookup hot path) call emit/3/span/3 without ever testing whether telemetry is present. When the lib is not loaded, both become no-ops — zero crash, zero noise — instead of scattering case code:ensure_loaded(...) everywhere.
  • Name contract. All erli18n event names live here, exposed as pre-typed event_*/0 functions. A rename or audit is a one-file change. The names are the public contract of observability (convention [<lib>, <operation>, <phase>], in the style of Phoenix.Logger).
  • Overhead and security policy. The high-frequency lookup events (miss/fuzzy_skip) are opt-in (flag emit_lookup_telemetry, default false) — this minimizes both the cost and the risk of leaking msgid content in a multi-tenant scenario. The memory_warning is rate-limited: at most one emission per configured window.

Mental model

Think of two layers, both lock-free from any process:

  • Telemetry detection (sticky-positive cache). The first call performs code:ensure_loaded(telemetry), which walks the code server. If it loads, the true result is stored in persistent_term and stays sticky for the rest of the VM's lifetime (telemetry does not unload at runtime). If it does not load, the false result is not cached: that way, if the consumer brings telemetry up mid-flight (application:start(telemetry)), the next emission already sees it. The price of this choice is, at most, one code:ensure_loaded/1 per emission while telemetry is absent (microseconds), and zero per emission once present.
  • Configuration via application:get_env/3. The flags (emit_lookup_telemetry, memory_warning_threshold, memory_warning_rate_limit_seconds) are read on every call — a direct read in the application controller's ETS (~100 ns). There is no per-process state and no caching of these flags.

Trusted vs untrusted: the rate-limit persistent_term key is private to this module. The functions narrow its value at the boundary; if something outside reuses the key and writes a non-integer, the code crashes explicitly instead of operating on garbage. Invalid configuration values (non-boolean, negative integer) also crash with {invalid_config, ...} — a loud, visible failure, never silent.

When a dev touches this module

  • Observability consumer (attaches handlers): use the event_*/0 names in telemetry:attach/4. Do not call emit/3/span/3 directly — erli18n is what emits.
  • Core maintainer (erli18n_server, hot path): call span/3 to instrument operations with start/stop (load/reload), emit/3 for pointwise events, and lookup_telemetry_enabled/0 to gate the lookup events before building expensive payloads. The loader calls memory_warning_check/1.

Quickstart (consumer)

%% Attach a handler to the catalog-load events:
1> telemetry:attach_many(
..     <<"erli18n-log">>,
..     [erli18n_telemetry:event_catalog_load(),
..      erli18n_telemetry:event_catalog_load() ++ [stop]],
..     fun(Event, Measurements, Meta, _Cfg) ->
..         io:format("~p ~p ~p~n", [Event, Measurements, Meta])
..     end,
..     undefined).
ok
%% Lookup events are opt-in; enable them explicitly:
2> application:set_env(erli18n, emit_lookup_telemetry, true).
ok
3> erli18n_telemetry:lookup_telemetry_enabled().
true

Key functions

References

Summary

Types

Name of a telemetry event: a list of atoms in the format [<lib>, <operation>, <phase>] (e.g. [erli18n, catalog, load]). It is the type returned by all event_*/0 functions and the one accepted by emit/3/span/3. The list contains the atoms of the erli18n vocabulary and admits a free atom() in the tail for extensions (e.g. the start/stop suffix that span/3 appends).

Map of an event's numeric measurements (e.g. #{duration => N}, #{ets_bytes => N}). Structurally it is just a map(); the telemetry convention is that measurements are aggregable values, distinct from qualitative metadata.

Map of an event's qualitative metadata (e.g. domain, locale, domain_locales_sample sample). Structurally it is just a map(); it carries context, not aggregable values.

Body of a span: a fun/0 that must return {Result, StopMetadata}, per the contract of telemetry:span/3. Result is propagated back by span/3; StopMetadata is merged into the stop event's metadata (or discarded on the no-op path, when telemetry is absent).

Return value of span/3: the Result produced by span_fun/0.

Functions

Emits a pointwise telemetry event (no start/stop semantics; for that use span/3).

Event prefix of a catalog's load span (ensure_loaded): [erli18n, catalog, load]. Since it is a span prefix (via span/3), the events actually emitted have the start/stop/exception suffix appended.

Name of the memory warning event: [erli18n, catalog, memory_warning]. Emitted by memory_warning_check/1 when the catalogs' ETS usage crosses memory_warning_threshold/0, rate-limited to at most one emission per memory_warning_rate_limit_seconds/0. Always on (does not go through the lookup flag).

Event prefix of a catalog's atomic reload span: [erli18n, catalog, reload]. As a span prefix, it receives the start/stop/exception suffix at runtime.

Name of the pointwise catalog unload event: [erli18n, catalog, unload]. Emitted via emit/3 (not a span).

Name of the fuzzy entry skip event in lookup (an entry marked #, fuzzy in the .po, which gettext ignores): [erli18n, lookup, fuzzy_skip]. A high-frequency event, opt-in under the same flag as the misses (lookup_telemetry_enabled/0).

Name of the lookup miss event (key not found in the catalog): [erli18n, lookup, miss]. A high-frequency event and therefore opt-in — only emitted when lookup_telemetry_enabled/0 returns true. Keeping the default off also avoids exposing msgid content in a multi-tenant scenario.

Name of the plural divergence warning event: [erli18n, plural, divergence_warning]. Emitted at load time when the Plural-Forms rule in the .po header diverges from the CLDR rule inlined for the locale (an informative validation — the .po header remains the source of truth at runtime). Always on (does not go through the lookup flag).

Gate for the high-frequency lookup events (event_lookup_miss/0 and event_lookup_fuzzy_skip/0). Call sites call this function before building expensive payloads, so that the overhead only exists when the operator opts in.

Inspects the MemInfo memory snapshot and emits at most one event_catalog_memory_warning/0, deciding among not-warning, suppressing by rate-limit, or warning. Called by the loader (erli18n_server) at the end of a successful load.

Window, in seconds, between successive emissions of event_catalog_memory_warning/0. Even if the threshold is crossed on every load, memory_warning_check/1 only re-emits after this window has elapsed since the last emission (mitigation: "once per crossing event, not on every tick").

Threshold, in bytes, of the catalogs' ETS usage above which event_catalog_memory_warning/0 becomes eligible. Compared against ets_bytes inside memory_warning_check/1 with a strict > (equaling the threshold does not fire).

Test-only: erases the two persistent_term keys of this module — the sticky "telemetry loaded" cache (?LOADED_KEY) and the memory_warning rate-limit anchor (?MEM_WARN_LAST_KEY) — simulating a fresh VM between test cases. It is not part of the documented API surface (do not rely on it in production). It always returns ok.

Runs Fun instrumented as a telemetry span, following the contract of telemetry:span/3 (events with start, stop, and exception).

Types

event_name()

-type event_name() ::
          [erli18n | catalog | lookup | plural | load | reload | unload | miss | fuzzy_skip |
           divergence_warning | memory_warning |
           atom()].

Name of a telemetry event: a list of atoms in the format [<lib>, <operation>, <phase>] (e.g. [erli18n, catalog, load]). It is the type returned by all event_*/0 functions and the one accepted by emit/3/span/3. The list contains the atoms of the erli18n vocabulary and admits a free atom() in the tail for extensions (e.g. the start/stop suffix that span/3 appends).

measurements()

-type measurements() :: map().

Map of an event's numeric measurements (e.g. #{duration => N}, #{ets_bytes => N}). Structurally it is just a map(); the telemetry convention is that measurements are aggregable values, distinct from qualitative metadata.

metadata()

-type metadata() :: map().

Map of an event's qualitative metadata (e.g. domain, locale, domain_locales_sample sample). Structurally it is just a map(); it carries context, not aggregable values.

span_fun()

-type span_fun() :: fun(() -> {term(), metadata()}).

Body of a span: a fun/0 that must return {Result, StopMetadata}, per the contract of telemetry:span/3. Result is propagated back by span/3; StopMetadata is merged into the stop event's metadata (or discarded on the no-op path, when telemetry is absent).

span_result()

-type span_result() :: term().

Return value of span/3: the Result produced by span_fun/0.

Functions

emit(EventName, Measurements, Metadata)

-spec emit(event_name(), measurements(), metadata()) -> ok.

Emits a pointwise telemetry event (no start/stop semantics; for that use span/3).

Parameters:

  • EventName — the event name, typically one of the event_*/0 (e.g. event_catalog_unload/0). Must be a list.
  • Measurements — map of numeric/aggregable measurements. Must be a map.
  • Metadata — map of qualitative metadata. Must be a map.

Behavior and return: if telemetry is loaded (see the sticky detection in the moduledoc), it delegates to telemetry:execute/3; otherwise it is a safe no-op. On both paths it always returns ok — the result of telemetry:execute/3 is discarded on purpose.

Failure modes: the clause is guarded (is_list/is_map/is_map); calling with the wrong types results in function_clause (caller crash). The erlang:apply(telemetry, execute, ...) indirection is intentional: it makes dialyzer treat the call as an unknown remote function when telemetry is genuinely absent from the PLT, mirroring the runtime story.

1> erli18n_telemetry:emit(
..     erli18n_telemetry:event_catalog_unload(),
..     #{count => 1},
..     #{domain => my_domain, locale => <<"fr">>}).
ok

The no-op path does not depend on telemetry being loaded in memory, but on telemetry being absent from the code path — detection (see telemetry_loaded/0 / moduledoc) uses code:ensure_loaded(telemetry), which would load the module from the code path if it existed there. In other words: code:is_loaded(telemetry) =:= false does not make emit/3 a no-op (the module would still be loaded and the event emitted). The no-op only occurs when the telemetry app is not in the release/code path; in that scenario the same call returns ok without emitting anything.

Sibling: span/3 (events with start/stop).

event_catalog_load()

-spec event_catalog_load() -> event_name().

Event prefix of a catalog's load span (ensure_loaded): [erli18n, catalog, load]. Since it is a span prefix (via span/3), the events actually emitted have the start/stop/exception suffix appended.

1> erli18n_telemetry:event_catalog_load().
[erli18n,catalog,load]

Siblings: event_catalog_reload/0, event_catalog_unload/0.

event_catalog_memory_warning()

-spec event_catalog_memory_warning() -> event_name().

Name of the memory warning event: [erli18n, catalog, memory_warning]. Emitted by memory_warning_check/1 when the catalogs' ETS usage crosses memory_warning_threshold/0, rate-limited to at most one emission per memory_warning_rate_limit_seconds/0. Always on (does not go through the lookup flag).

1> erli18n_telemetry:event_catalog_memory_warning().
[erli18n,catalog,memory_warning]

Emitter: memory_warning_check/1.

event_catalog_reload()

-spec event_catalog_reload() -> event_name().

Event prefix of a catalog's atomic reload span: [erli18n, catalog, reload]. As a span prefix, it receives the start/stop/exception suffix at runtime.

1> erli18n_telemetry:event_catalog_reload().
[erli18n,catalog,reload]

Siblings: event_catalog_load/0, event_catalog_unload/0.

event_catalog_unload()

-spec event_catalog_unload() -> event_name().

Name of the pointwise catalog unload event: [erli18n, catalog, unload]. Emitted via emit/3 (not a span).

1> erli18n_telemetry:event_catalog_unload().
[erli18n,catalog,unload]

Siblings: event_catalog_load/0, event_catalog_reload/0.

event_lookup_fuzzy_skip()

-spec event_lookup_fuzzy_skip() -> event_name().

Name of the fuzzy entry skip event in lookup (an entry marked #, fuzzy in the .po, which gettext ignores): [erli18n, lookup, fuzzy_skip]. A high-frequency event, opt-in under the same flag as the misses (lookup_telemetry_enabled/0).

1> erli18n_telemetry:event_lookup_fuzzy_skip().
[erli18n,lookup,fuzzy_skip]

Sibling: event_lookup_miss/0. Gate: lookup_telemetry_enabled/0.

event_lookup_miss()

-spec event_lookup_miss() -> event_name().

Name of the lookup miss event (key not found in the catalog): [erli18n, lookup, miss]. A high-frequency event and therefore opt-in — only emitted when lookup_telemetry_enabled/0 returns true. Keeping the default off also avoids exposing msgid content in a multi-tenant scenario.

1> erli18n_telemetry:event_lookup_miss().
[erli18n,lookup,miss]

Sibling: event_lookup_fuzzy_skip/0. Gate: lookup_telemetry_enabled/0.

event_plural_divergence()

-spec event_plural_divergence() -> event_name().

Name of the plural divergence warning event: [erli18n, plural, divergence_warning]. Emitted at load time when the Plural-Forms rule in the .po header diverges from the CLDR rule inlined for the locale (an informative validation — the .po header remains the source of truth at runtime). Always on (does not go through the lookup flag).

1> erli18n_telemetry:event_plural_divergence().
[erli18n,plural,divergence_warning]

lookup_telemetry_enabled()

-spec lookup_telemetry_enabled() -> boolean().

Gate for the high-frequency lookup events (event_lookup_miss/0 and event_lookup_fuzzy_skip/0). Call sites call this function before building expensive payloads, so that the overhead only exists when the operator opts in.

Reads the app env emit_lookup_telemetry (default false — opt-in, also for multi-tenant security reasons). The read is a direct access to the application controller's ETS (~100 ns); this function does not eliminate the overhead of looking up the flag itself, only that of having handlers attached — it is the theoretical limit of the design.

Return and failure modes: true for true, false for false. Any other configured value is a configuration error and triggers an explicit crash with error({invalid_config, {erli18n, emit_lookup_telemetry, Other, expected, boolean}}) — a loud, visible failure, never a silent "treat as false".

1> erli18n_telemetry:lookup_telemetry_enabled().
false
2> application:set_env(erli18n, emit_lookup_telemetry, true).
ok
3> erli18n_telemetry:lookup_telemetry_enabled().
true
4> application:set_env(erli18n, emit_lookup_telemetry, "yes").
ok
5> erli18n_telemetry:lookup_telemetry_enabled().
** exception error: {invalid_config,{erli18n,emit_lookup_telemetry,"yes",expected,boolean}}

Siblings (config): memory_warning_threshold/0, memory_warning_rate_limit_seconds/0.

memory_warning_check(MemInfo)

-spec memory_warning_check(map()) -> not_warned | rate_limited | warned.

Inspects the MemInfo memory snapshot and emits at most one event_catalog_memory_warning/0, deciding among not-warning, suppressing by rate-limit, or warning. Called by the loader (erli18n_server) at the end of a successful load.

Parameter:

  • MemInfo — a snapshot map. The keys read are ets_bytes (ETS usage, the trigger; default 0 if absent), num_catalogs and num_keys (only used in the measurement when warning; default 0). Must be a map, otherwise function_clause.

Decision logic:

  1. If ets_bytes is not > memory_warning_threshold/0, returns not_warned (strict > comparison).
  2. Otherwise, if the memory_warning_rate_limit_seconds/0 window has not yet elapsed since the last emission, returns rate_limited without emitting.
  3. Otherwise, writes the current instant to the anchor, builds the sample and emits via emit/3, returning warned.

Side effects: the rate-limit anchor is a private key in persistent_term (lock-free from any process), updated only on an actual emission. Rewriting the key via persistent_term:put/2 may trigger GC work proportional to the processes that still hold references to the previous value of this key — not an unconditional global full GC of the VM. Here that is cheap (the previous value is a single timestamp integer, with no long-lived holders) and, moreover, it only happens on the warned path (rare, by design), so the cost is acceptable. The payload of the warned event has:

  • measurements #{ets_bytes, threshold_bytes, num_catalogs, num_keys};
  • metadata #{domain_locales_sample => [...]}, a sample of up to 10 {Domain, Locale} pairs (payload bound in a multi-tenant deployment), collected by collect_domain_locales_sample/0.

Failure modes: if ets_bytes or the counters are non-numeric, the > or the construction of the measurements crash. If the persistent_term anchor holds a non-integer (someone reusing the private key — a contract violation), the boundary crashes with {invalid_persistent_term, ...} instead of operating on garbage.

%% Below the default threshold (100 MiB): nothing happens.
1> erli18n_telemetry:memory_warning_check(#{ets_bytes => 1024}).
not_warned
%% Above the threshold: the first call warns...
2> erli18n_telemetry:memory_warning_check(
..     #{ets_bytes => 209715200, num_catalogs => 3, num_keys => 4096}).
warned
%% ...and the next one, within the rate-limit window, is suppressed.
3> erli18n_telemetry:memory_warning_check(#{ets_bytes => 209715200}).
rate_limited

Config: memory_warning_threshold/0, memory_warning_rate_limit_seconds/0. Event: event_catalog_memory_warning/0. In tests, reset_caches/0 zeroes the anchor.

memory_warning_rate_limit_seconds()

-spec memory_warning_rate_limit_seconds() -> non_neg_integer().

Window, in seconds, between successive emissions of event_catalog_memory_warning/0. Even if the threshold is crossed on every load, memory_warning_check/1 only re-emits after this window has elapsed since the last emission (mitigation: "once per crossing event, not on every tick").

Reads the app env memory_warning_rate_limit_seconds (default 60).

Return and failure modes: a valid non_neg_integer(). A value that is not an integer >= 0 triggers a crash with error({invalid_config, {erli18n, memory_warning_rate_limit_seconds, Other, expected, non_neg_integer}}). A value of 0 makes every crossing re-emit (a degenerate window, with no effective rate limit).

1> erli18n_telemetry:memory_warning_rate_limit_seconds().
60
2> application:set_env(erli18n, memory_warning_rate_limit_seconds, 300).
ok
3> erli18n_telemetry:memory_warning_rate_limit_seconds().
300

Consumer: memory_warning_check/1. Sibling: memory_warning_threshold/0.

memory_warning_threshold()

-spec memory_warning_threshold() -> non_neg_integer().

Threshold, in bytes, of the catalogs' ETS usage above which event_catalog_memory_warning/0 becomes eligible. Compared against ets_bytes inside memory_warning_check/1 with a strict > (equaling the threshold does not fire).

Reads the app env memory_warning_threshold (default 104857600, 100 MiB).

Return and failure modes: a valid non_neg_integer(). Any value that is not an integer >= 0 (negative, non-integer) triggers a crash with error({invalid_config, {erli18n, memory_warning_threshold, Other, expected, non_neg_integer}}).

1> erli18n_telemetry:memory_warning_threshold().
104857600
2> application:set_env(erli18n, memory_warning_threshold, 52428800).
ok
3> erli18n_telemetry:memory_warning_threshold().
52428800
4> application:set_env(erli18n, memory_warning_threshold, -1).
ok
5> erli18n_telemetry:memory_warning_threshold().
** exception error: {invalid_config,{erli18n,memory_warning_threshold,-1,expected,non_neg_integer}}

Consumer: memory_warning_check/1. Sibling: memory_warning_rate_limit_seconds/0.

reset_caches()

-spec reset_caches() -> ok.

Test-only: erases the two persistent_term keys of this module — the sticky "telemetry loaded" cache (?LOADED_KEY) and the memory_warning rate-limit anchor (?MEM_WARN_LAST_KEY) — simulating a fresh VM between test cases. It is not part of the documented API surface (do not rely on it in production). It always returns ok.

Useful for making deterministic the tests of memory_warning_check/1 (which switches from warned to rate_limited depending on the anchor) and those of telemetry detection.

1> erli18n_telemetry:reset_caches().
ok

span(EventPrefix, StartMetadata, Fun)

-spec span(event_name(), metadata(), span_fun()) -> span_result().

Runs Fun instrumented as a telemetry span, following the contract of telemetry:span/3 (events with start, stop, and exception).

Parameters:

  • EventPrefix — the event prefix (e.g. event_catalog_load/0). Telemetry appends start/stop/exception to this prefix. Must be a list.
  • StartMetadata — metadata already available in the start event (and merged into stop). Must be a map.
  • Fun — the span body, a fun/0 that MUST return {Result, StopMetadata} (see span_fun/0).

Contract semantics (path with telemetry loaded): emits EventPrefix ++ [start] with measurements #{monotonic_time, system_time}; runs Fun; emits EventPrefix ++ [stop] with #{monotonic_time, duration} and StartMetadata merged with StopMetadata. If Fun raises an exception, it emits EventPrefix ++ [exception] (with #{kind, reason, stacktrace} in the metadata) instead of stop, and the exception re-propagates. It delegates to telemetry:span/3 to keep the measurements byte-equal to what :telemetry users expect.

No-op path semantics (telemetry absent): it still runs Fun — otherwise the lib would behave differently with vs without telemetry, which is unacceptable — and discards StopMetadata (there is nowhere to emit it). No event is emitted.

Return: on both paths, the Result produced by Fun (see span_result/0).

Failure modes: guarded clause (is_list/is_map/is_function(Fun, 0)); wrong types => function_clause. If Fun does not return a {Result, StopMetadata} tuple, both paths crash, but asymmetrically with respect to the events already emitted:

  • No-op path (telemetry absent): crashes with badmatch at {Result, _StopMetadata} = Fun() before any emission — no event goes out (consistent with the no-op never emitting anything).
  • Path with telemetry: telemetry:span/3 has already emitted the EventPrefix ++ [start] event before inspecting Fun's return, so the consumer sees an orphan start (without a matching stop or exception) followed by the crash inside the telemetry lib itself when matching the invalid shape. This is exactly the symptom to look for when debugging start events without a stop.
1> erli18n_telemetry:span(
..     erli18n_telemetry:event_catalog_load(),
..     #{domain => my_domain, locale => <<"fr">>},
..     fun() ->
..         Result = do_load(),           %% instrumented work
..         {Result, #{entries => 128}}   %% {Result, StopMetadata}
..     end).
Result

Sibling: emit/3 (pointwise events).