Cantrip emits structured :telemetry events at process, gate, and medium boundaries. This doc is the canonical reference for what gets emitted, how to subscribe, and what to alert on.

Audience: operators deploying Cantrip, instrumentation engineers, production support.

Standard: every documented event is asserted by a regression test. Events not on this list are not load-bearing.


Event registry

All events are emitted under the [:cantrip, ...] prefix.

EventMeasurementsMetadataEmitted from
[:cantrip, :entity, :start]entity_id, intent, trace_idEntityServer.handle_call(:run, ...) when an episode begins
[:cantrip, :entity, :stop]durationentity_id, reason, trace_idEntityServer.emit_entity_stop/2 when an episode terminates or is truncated
[:cantrip, :turn, :start]entity_id, turn_number, trace_idEntityServer.run_loop/1 per turn
[:cantrip, :turn, :stop]durationentity_id, turn_number, trace_idEntityServer.emit_turn_stop/3 per turn
[:cantrip, :gate, :start]entity_id, gate_name, trace_idGate.Executor.emit_gate_start/2 per gate invocation
[:cantrip, :gate, :stop]durationentity_id, gate_name, is_error, trace_idGate.Executor.emit_gate_stop/4 per gate invocation
[:cantrip, :code, :eval]durationentity_id, trace_idMedium.Code per LLM-emitted Elixir evaluation
[:cantrip, :bash, :eval]durationentity_id, trace_idMedium.Bash per shell command
[:cantrip, :usage]prompt_tokens, completion_tokens, total_tokensentity_id, turn_number, trace_idEntityServer.run_loop/1 after provider response
[:cantrip, :redact, :hit]countentity_id, trace_idRedact.scan/1 when boundary redaction removes a credential
[:cantrip, :fold, :trigger]entity_id, turn_number, trace_idEntityServer.run_loop/1 when folding fires
[:cantrip, :ward, :truncate]entity_id, ward, trace_idEntityServer.run_loop/1 when a ward stops execution
[:cantrip, :ward, :child_rejected]countentity_id, child_id, child_medium, reason, trace_idchild-cast coordinator when declaration-time child wards reject a spawn
[:cantrip, :child, :start]entity_id, child_depth, trace_idchild-cast coordinator before child cast
[:cantrip, :child, :stop]entity_id, child_depth, outcome, trace_idchild-cast coordinator after child cast
[:cantrip, :loom, :persist_error]countstorage_module, event_type, reason, trace_idLoom.append_event/2 when the storage backend rejects a write
[:cantrip, :compile_and_load]durationentity_id, module, outcome, trace_idEntityServer.execute_compile_and_load/2 per hot-load attempt

duration measurements are System.monotonic_time/0 deltas (native units — convert with System.convert_time_unit/3 at the subscriber).

Metadata invariants

  • entity_id is always a binary, present on every event.
  • trace_id is always a binary, present on every event. Propagates from parent cantrip context through child cantrips so a full trace forms a tree rooted at the originating episode.
  • User-supplied strings that are intentionally useful for operations, such as root intents, pass through the internal redaction boundary before emission so credential-shaped substrings are scrubbed. LLM responses, provider response bodies, bearer tokens, and raw credentials must not appear in event metadata.

Subscribing

Quick local logging

:telemetry.attach_many(
  "cantrip-logger",
  [
    [:cantrip, :entity, :start],
    [:cantrip, :entity, :stop],
    [:cantrip, :turn, :stop],
    [:cantrip, :gate, :stop]
  ],
  fn event, measurements, metadata, _config ->
    Logger.info(
      "#{Enum.join(event, ".")} | #{inspect(measurements)} | #{inspect(metadata)}"
    )
  end,
  nil
)

Production observability stack

The event prefix [:cantrip, ...] maps cleanly to most metric backends. Recommended subscriptions for production deployments:

  • [:cantrip, :turn, :stop] → histogram of duration per entity_id for turn-latency tracking.
  • [:cantrip, :gate, :stop] → histogram of duration per gate_name; counter of is_error: true per gate_name for gate-error rates.
  • [:cantrip, :entity, :stop] → counter per reason to track terminated vs truncated vs error termination.
  • [:cantrip, :usage] → counters for prompt/completion/total token volume per entity_id.
  • [:cantrip, :ward, :truncate] → counter per ward to see which guard is stopping work.
  • [:cantrip, :ward, :child_rejected] → counter per reason to catch child-spawn policy pressure or prompt drift.
  • [:cantrip, :redact, :hit] → counter of credential-shaped content removed from entity/model-visible boundaries.
  • [:cantrip, :child, :start] / [:cantrip, :child, :stop] → counters and outcome tags for delegation fanout.
  • [:cantrip, :code, :eval] and [:cantrip, :bash, :eval] → histogram of duration for medium-evaluation latency.

Example StatsD attachment (using telemetry_metrics_statsd):

metrics = [
  Telemetry.Metrics.distribution("cantrip.turn.stop.duration",
    event_name: [:cantrip, :turn, :stop],
    measurement: :duration,
    unit: {:native, :millisecond}
  ),
  Telemetry.Metrics.distribution("cantrip.gate.stop.duration",
    event_name: [:cantrip, :gate, :stop],
    measurement: :duration,
    unit: {:native, :millisecond},
    tags: [:gate_name]
  ),
  Telemetry.Metrics.counter("cantrip.gate.error.count",
    event_name: [:cantrip, :gate, :stop],
    keep: &(&1.is_error)
  )
]

TelemetryMetricsStatsd.start_link(metrics: metrics)

Prometheus, Datadog, and other backends have equivalent Telemetry.Metrics-based adapters.


SignalThresholdWhy
cantrip.gate.error.rate> 5% over 5 min, per gate_nameHigh gate error rate = LLM misuse or provider drift
cantrip.turn.stop.duration p95> 60sLong turns suggest provider slowness, runaway code-medium evaluation, or hung gate
cantrip.entity.stop.reason = :truncated> 10% over 1 hourHigh truncation rate = max_turns ward set too low for the workload
cantrip.ward.truncate.countsudden increase by wardA runtime guard is stopping work more often than expected
cantrip.redact.hit.countany unexpected sustained rateUser data or files contain credential-shaped content reaching observation boundaries
cantrip.code.eval.duration p95> 30sLong code-medium evaluations suggest sandbox starvation or hung port

Trace correlation

trace_id propagates through child cantrips via the parent context. A full trace for a parent episode that spawns N child cantrips is:

trace_id = "<root-uuid>"
   [:cantrip, :entity, :start] entity_id=parent_id
     [:cantrip, :turn, :start] turn_number=1
     [:cantrip, :gate, :start] gate_name=call_entity  spawns child
       [:cantrip, :entity, :start] entity_id=child_id  (same trace_id)
       [:cantrip, :turn, :start] turn_number=1
       [:cantrip, :entity, :stop] entity_id=child_id
     [:cantrip, :gate, :stop] gate_name=call_entity
     [:cantrip, :turn, :stop] turn_number=1
   [:cantrip, :entity, :stop] entity_id=parent_id

All events in this tree carry the same trace_id. To correlate to external systems (HTTP request IDs, job queue IDs, etc.), pass the external ID as trace_id when running the top-level cantrip:

Cantrip.cast(cantrip, intent, trace_id: external_request_id)

ACP requests can use the protocol metadata channel. Put a non-empty string in _meta.trace_id (or _meta.cantrip_trace_id) on session/new or session/prompt; the Familiar ACP runtime stores it on the session and passes it into Cantrip.summon/3 or Cantrip.send/3 so entity, turn, gate, usage, child, and code events carry the caller's external trace ID. Other _meta fields are ignored by Cantrip's ACP boundary; editor metadata cannot override the configured LLM, loom path, or turn budget.

{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "session/prompt",
  "params": {
    "sessionId": "sess_123",
    "_meta": {"trace_id": "http-request-abc"},
    "prompt": [{"type": "text", "text": "Inspect the failing test"}]
  }
}

When no external trace ID is supplied, Cantrip mints a fresh per-session entity trace ID.


What is not emitted (and why)

  • LLM provider request/response bodies. Too large and contain prompts. Use :telemetry.attach_many with your own redaction if you need partial visibility into provider traffic; do not log raw bodies.
  • Loom record contents. The loom is the durable trace; subscribe to the loom directly via Cantrip.Loom API if you need turn-level data. Telemetry is for operational metrics, not data plane.
  • Stack traces. Errors arrive as already-redacted observation strings. Unredacted stack traces stay internal.

Event Registry In Code

The runtime event registry is used by tests and documentation review. New telemetry surfaces should be added there first, then pinned by a regression test and documented in the table above.