Observability

View Source

A barrel_p2p cluster has three categories of telemetry you will want in production: membership transitions, authentication outcomes, and dist-layer events (broadcast, GC, migration). All of them go through one module, barrel_p2p_metrics, which in turn emits to the instrument library.

The principle: every emit site is wrapped in a try/catch. A misconfigured exporter cannot crash protocol code. If instrument is not running, the emit becomes a no-op.

This document is the catalogue, organised by subsystem, plus a short guide on wiring an exporter.

Conventions

  • Instrument names are dot-namespaced under barrel_p2p.<subsystem>.<event>.
  • Attribute keys are atoms: peer, outcome, reason, role, from, target.
  • Counter values are integers; histogram values are milliseconds.
  • The instrument application must be started; otherwise emits silently no-op.

Membership (HyParView)

NameKindAttributesFires when
barrel_p2p.hyparview.peer_upcounterpeerA node enters the local active view
barrel_p2p.hyparview.peer_downcounterpeer, reasonA node leaves the active view
barrel_p2p.hyparview.joinedcounter-The local node joined the cluster
barrel_p2p.hyparview.leftcounter-The local node left the cluster
barrel_p2p.hyparview.shufflecountertargetThe local node initiated a shuffle
barrel_p2p.hyparview.pending_timeoutcounterpeerA pending JOIN/CONNECT/NEIGHBOR backstop fired

reason is normalised: an atom stays as-is, a {tag, _} tuple is reduced to its tag, anything else becomes other.

The pending_timeout counter is a backstop. A non-zero rate means some of your peers are responding to JOIN but never producing a peer_connected or peer_failed callback; this usually points at a network drop in one direction. The cluster recovers, but it is worth investigating.

Authentication

NameKindAttributesRecords
barrel_p2p.dist.auth.attemptscounterrole, outcomeOne per handshake attempt
barrel_p2p.dist.auth.duration_mshistogramrole, outcomeHandshake wall time, milliseconds

role is outgoing (we dialed) or incoming (we accepted). outcome is ok or fail. A handshake that crashes counts as fail; the metric is recorded before the exception is re-raised, so you never lose an attempt.

A non-trivial fail rate is the signal worth alerting on. A pinned peer reconnecting with a new key, a wrong cookie, a clock skew beyond the configured window: all of these surface as fail.

Plumtree gossip

NameKindAttributesFires when
barrel_p2p.plumtree.gossip.sentcounter-Each GOSSIP frame placed on the wire
barrel_p2p.plumtree.gossip.receivedcounterfromA GOSSIP frame arrives
barrel_p2p.plumtree.ihave.sentcounter-Each IHAVE frame placed on the wire
barrel_p2p.plumtree.graft.sentcounterpeerA GRAFT request is sent
barrel_p2p.plumtree.prune.sentcounterpeerA PRUNE notification is sent

sent counters add length(Peers) per fanout, so the totals match the number of frames placed on the wire, not the number of broadcasts.

A reasonable health check: the ratio of graft.sent to gossip.received should be small. A high graft rate means lots of self-healing, which is usually a symptom of churn in the active view.

Idle dist-channel GC

NameKindAttributesFires when
barrel_p2p.dist_gc.reapcounterpeerThe reaper closes an idle dist channel

A non-zero rate is normal. It means Pid ! Msg opened ad-hoc dist channels that no one used afterwards. A sustained burst suggests the sweep period or min_age tuning is too aggressive for your workload, or that your application closes dist channels too often.

Connection migration

NameKindAttributesFires when
barrel_p2p.dist.migratecounterpeer, outcomeA call to barrel_p2p:migrate_peer/1,2

outcome is ok when path validation succeeded, otherwise fail. If you wrote a custom trigger (see migration.md), the peer attribute tells you which peer the trigger acted on.

Router and service proxy

NameKindAttributesFires when
barrel_p2p.router.request_droppedcounter-A route request was refused (cap reached)
barrel_p2p.service_proxy.cast_droppedcounter-An overlay cast was refused (cap reached)

These are operator signals. A non-zero rate means the router or a proxy is hitting its in-flight cap. If sustained, raise router_max_in_flight or proxy_cast_max_in_flight in sys.config.

Streams demultiplexer

NameKindAttributesFires when
barrel_p2p.streams.preamble_droppedcounter-An inbound stream was reset for not completing the tag preamble

A non-zero rate suggests a buggy peer is opening streams without sending the tag preamble. In production this should be zero.

Wiring an exporter

instrument does not ship a standalone HTTP server. It gives you the building blocks; you wire them into whatever HTTP layer your release already uses.

Prometheus

instrument_prometheus is a formatter, not a server. Two functions matter:

A minimal cowboy handler:

-module(my_metrics_handler).
-export([init/2]).

init(Req0, State) ->
    Body = instrument_prometheus:format(),
    Headers = #{<<"content-type">> => instrument_prometheus:content_type()},
    Req = cowboy_req:reply(200, Headers, Body, Req0),
    {ok, Req, State}.

Wire it in your router and point your Prometheus scraper at the resulting endpoint.

Barrel P2P emits as soon as its supervision tree is up. Make sure instrument is in your release applications list (it is pulled in as a transitive dependency of barrel_p2p, so you usually do not have to add it explicitly).

OTLP

OTLP export is configured through the instrument application env or through the standard OTEL_* environment variables. The canonical setup lives in the upstream instrument README; barrel_p2p does not add or replace any of it.

A typical sys.config entry:

{instrument, [
    {service_name, <<"my_barrel_p2p_node">>}
]}.

Combined with OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318 in the node's environment, this is enough for the metrics to flow.

What to alert on

A short list of metrics that tend to matter in production:

  • barrel_p2p.dist.auth.attempts{outcome=fail} rate. Sustained failures are either a misconfiguration (wrong cookie, wrong proto_dist), a clock issue, a rotation in progress, or an active attack. Either way they warrant a human's attention.
  • barrel_p2p.hyparview.peer_down{reason=nodedown} spikes. A burst of node-downs usually means a network event. The cluster recovers, but the spike is the trigger for investigation.
  • barrel_p2p.dist_gc.reap rate vs steady-state baseline. A sudden change either way is worth looking at: a high rate suggests an application opening too many ad-hoc dist channels; a low rate after a baseline of activity may mean channels are not being released.
  • barrel_p2p.dist.auth.duration_ms p95. A creeping p95 is an early signal that the cluster is loaded or that the authentication code path is contending on file I/O (the keypair is read from disk per attempt).