Cantrip emits structured :telemetry events at process, gate, and medium
boundaries. This doc is the canonical reference for what gets emitted, how to
subscribe, and what to alert on.
Audience: operators deploying Cantrip, instrumentation engineers, production support.
Standard: every documented event is asserted by a regression test. Events not on this list are not load-bearing.
Event registry
All events are emitted under the [:cantrip, ...] prefix.
| Event | Measurements | Metadata | Emitted from |
|---|---|---|---|
[:cantrip, :entity, :start] | — | entity_id, intent, trace_id | EntityServer.handle_call(:run, ...) when an episode begins |
[:cantrip, :entity, :stop] | duration | entity_id, reason, trace_id | EntityServer.emit_entity_stop/2 when an episode terminates or is truncated |
[:cantrip, :turn, :start] | — | entity_id, turn_number, trace_id | EntityServer.run_loop/1 per turn |
[:cantrip, :turn, :stop] | duration | entity_id, turn_number, trace_id | EntityServer.emit_turn_stop/3 per turn |
[:cantrip, :gate, :start] | — | entity_id, gate_name, trace_id | Gate.Executor.emit_gate_start/2 per gate invocation |
[:cantrip, :gate, :stop] | duration | entity_id, gate_name, is_error, trace_id | Gate.Executor.emit_gate_stop/4 per gate invocation |
[:cantrip, :code, :eval] | duration | entity_id, trace_id | Medium.Code per LLM-emitted Elixir evaluation |
[:cantrip, :bash, :eval] | duration | entity_id, trace_id | Medium.Bash per shell command |
[:cantrip, :usage] | prompt_tokens, completion_tokens, total_tokens | entity_id, turn_number, trace_id | EntityServer.run_loop/1 after provider response |
[:cantrip, :redact, :hit] | count | entity_id, trace_id | Redact.scan/1 when boundary redaction removes a credential |
[:cantrip, :fold, :trigger] | — | entity_id, turn_number, trace_id | EntityServer.run_loop/1 when folding fires |
[:cantrip, :ward, :truncate] | — | entity_id, ward, trace_id | EntityServer.run_loop/1 when a ward stops execution |
[:cantrip, :ward, :child_rejected] | count | entity_id, child_id, child_medium, reason, trace_id | child-cast coordinator when declaration-time child wards reject a spawn |
[:cantrip, :child, :start] | — | entity_id, child_depth, trace_id | child-cast coordinator before child cast |
[:cantrip, :child, :stop] | — | entity_id, child_depth, outcome, trace_id | child-cast coordinator after child cast |
[:cantrip, :loom, :persist_error] | count | storage_module, event_type, reason, trace_id | Loom.append_event/2 when the storage backend rejects a write |
[:cantrip, :compile_and_load] | duration | entity_id, module, outcome, trace_id | EntityServer.execute_compile_and_load/2 per hot-load attempt |
duration measurements are System.monotonic_time/0 deltas (native units —
convert with System.convert_time_unit/3 at the subscriber).
Metadata invariants
entity_idis always a binary, present on every event.trace_idis always a binary, present on every event. Propagates from parent cantrip context through child cantrips so a full trace forms a tree rooted at the originating episode.- User-supplied strings that are intentionally useful for operations, such as root intents, pass through the internal redaction boundary before emission so credential-shaped substrings are scrubbed. LLM responses, provider response bodies, bearer tokens, and raw credentials must not appear in event metadata.
Subscribing
Quick local logging
:telemetry.attach_many(
"cantrip-logger",
[
[:cantrip, :entity, :start],
[:cantrip, :entity, :stop],
[:cantrip, :turn, :stop],
[:cantrip, :gate, :stop]
],
fn event, measurements, metadata, _config ->
Logger.info(
"#{Enum.join(event, ".")} | #{inspect(measurements)} | #{inspect(metadata)}"
)
end,
nil
)Production observability stack
The event prefix [:cantrip, ...] maps cleanly to most metric backends.
Recommended subscriptions for production deployments:
[:cantrip, :turn, :stop]→ histogram ofdurationperentity_idfor turn-latency tracking.[:cantrip, :gate, :stop]→ histogram ofdurationpergate_name; counter ofis_error: truepergate_namefor gate-error rates.[:cantrip, :entity, :stop]→ counter perreasonto track terminated vs truncated vs error termination.[:cantrip, :usage]→ counters for prompt/completion/total token volume perentity_id.[:cantrip, :ward, :truncate]→ counter perwardto see which guard is stopping work.[:cantrip, :ward, :child_rejected]→ counter perreasonto catch child-spawn policy pressure or prompt drift.[:cantrip, :redact, :hit]→ counter of credential-shaped content removed from entity/model-visible boundaries.[:cantrip, :child, :start]/[:cantrip, :child, :stop]→ counters and outcome tags for delegation fanout.[:cantrip, :code, :eval]and[:cantrip, :bash, :eval]→ histogram ofdurationfor medium-evaluation latency.
Example StatsD attachment (using telemetry_metrics_statsd):
metrics = [
Telemetry.Metrics.distribution("cantrip.turn.stop.duration",
event_name: [:cantrip, :turn, :stop],
measurement: :duration,
unit: {:native, :millisecond}
),
Telemetry.Metrics.distribution("cantrip.gate.stop.duration",
event_name: [:cantrip, :gate, :stop],
measurement: :duration,
unit: {:native, :millisecond},
tags: [:gate_name]
),
Telemetry.Metrics.counter("cantrip.gate.error.count",
event_name: [:cantrip, :gate, :stop],
keep: &(&1.is_error)
)
]
TelemetryMetricsStatsd.start_link(metrics: metrics)Prometheus, Datadog, and other backends have equivalent
Telemetry.Metrics-based adapters.
Recommended alerts
| Signal | Threshold | Why |
|---|---|---|
cantrip.gate.error.rate | > 5% over 5 min, per gate_name | High gate error rate = LLM misuse or provider drift |
cantrip.turn.stop.duration p95 | > 60s | Long turns suggest provider slowness, runaway code-medium evaluation, or hung gate |
cantrip.entity.stop.reason = :truncated | > 10% over 1 hour | High truncation rate = max_turns ward set too low for the workload |
cantrip.ward.truncate.count | sudden increase by ward | A runtime guard is stopping work more often than expected |
cantrip.redact.hit.count | any unexpected sustained rate | User data or files contain credential-shaped content reaching observation boundaries |
cantrip.code.eval.duration p95 | > 30s | Long code-medium evaluations suggest sandbox starvation or hung port |
Trace correlation
trace_id propagates through child cantrips via the parent context. A full
trace for a parent episode that spawns N child cantrips is:
trace_id = "<root-uuid>"
├─ [:cantrip, :entity, :start] entity_id=parent_id
│ ├─ [:cantrip, :turn, :start] turn_number=1
│ ├─ [:cantrip, :gate, :start] gate_name=call_entity → spawns child
│ │ ├─ [:cantrip, :entity, :start] entity_id=child_id (same trace_id)
│ │ ├─ [:cantrip, :turn, :start] turn_number=1
│ │ └─ [:cantrip, :entity, :stop] entity_id=child_id
│ ├─ [:cantrip, :gate, :stop] gate_name=call_entity
│ └─ [:cantrip, :turn, :stop] turn_number=1
└─ [:cantrip, :entity, :stop] entity_id=parent_idAll events in this tree carry the same trace_id. To correlate to external
systems (HTTP request IDs, job queue IDs, etc.), pass the external ID as
trace_id when running the top-level cantrip:
Cantrip.cast(cantrip, intent, trace_id: external_request_id)ACP requests can use the protocol metadata channel. Put a non-empty string in
_meta.trace_id (or _meta.cantrip_trace_id) on session/new or
session/prompt; the Familiar ACP runtime stores it on the session and passes
it into Cantrip.summon/3 or Cantrip.send/3 so entity, turn, gate, usage,
child, and code events carry the caller's external trace ID. Other _meta
fields are ignored by Cantrip's ACP boundary; editor metadata cannot override
the configured LLM, loom path, or turn budget.
{
"jsonrpc": "2.0",
"id": 7,
"method": "session/prompt",
"params": {
"sessionId": "sess_123",
"_meta": {"trace_id": "http-request-abc"},
"prompt": [{"type": "text", "text": "Inspect the failing test"}]
}
}When no external trace ID is supplied, Cantrip mints a fresh per-session entity trace ID.
What is not emitted (and why)
- LLM provider request/response bodies. Too large and contain prompts.
Use
:telemetry.attach_manywith your own redaction if you need partial visibility into provider traffic; do not log raw bodies. - Loom record contents. The loom is the durable trace; subscribe to the
loom directly via
Cantrip.LoomAPI if you need turn-level data. Telemetry is for operational metrics, not data plane. - Stack traces. Errors arrive as already-redacted observation strings. Unredacted stack traces stay internal.
Event Registry In Code
The runtime event registry is used by tests and documentation review. New telemetry surfaces should be added there first, then pinned by a regression test and documented in the table above.