Internals
View SourceThis document describes how barrel_p2p works underneath the API. It is meant for two readers: Erlang developers who have followed the getting started guide and the practice handbook, and contributors planning a change to the protocols.
We move from the layers closest to the application code down to the dist carrier and the credentials, then describe the few in-tree services that keep the whole picture together.
Layered view
+----------------------------------------------------------+
| Application |
| barrel_p2p:register_service/2 whereis_service Pid ! Msg|
+------------------------+---------------------------------+
|
+------------------------+---------------------------------+
| barrel_p2p.erl |
| (public API: thin wrappers over the layers below) |
+----+--------+--------------+---------------+-------------+
| | | |
v v v v
+--------+ +--------+ +---------------+ +-----------+
|HyParView| |Registry| | Plumtree | | Streams |
|Membership|(OR-Map)| |Gossip broadcast| | multiplex|
+--------+ +--------+ +---------------+ +-----------+
|
v
+------------------+
| barrel_p2p_dist |
| (proto_dist |
| shim + cert |
| + defaults) |
+--------+---------+
|
+--------+---------+
| quic_dist | <-- upstream
| (auth callback,|
| discovery, port|
| pinning, etc.) |
+--------+---------+
|
+--------+---------+
| erlang_quic | <-- transport
+------------------+The application only ever calls into barrel_p2p.erl. Every other
module in the diagram is internal; you are free to read its code,
but you should not depend on its names from outside.
Supervision tree
A running barrel_p2p node has the following supervision shape:
barrel_p2p_sup (one_for_one)
|
+- barrel_p2p_hlc hybrid logical clock for CRDT timestamps
|
+- barrel_p2p_dist_keys trust store (ETS) for Ed25519 pins
|
+- barrel_p2p_hyparview_sup (rest_for_one)
| +- barrel_p2p_hyparview HyParView state machine
| +- barrel_p2p_hyparview_events subscriber bus for peer_up/peer_down
| +- barrel_p2p_hyparview_shuffle periodic view exchange timer
| +- barrel_p2p_hyparview_cleanup passive view aging timer
|
+- barrel_p2p_plumtree_sup
| +- barrel_p2p_plumtree epidemic broadcast tree
|
+- barrel_p2p_registry_sup (rest_for_one)
| +- barrel_p2p_registry local registry + CRDT
| +- barrel_p2p_registry_replica replication driver (barrel_p2p_replica)
|
+- barrel_p2p_leader cluster-wide singleton election
+- barrel_p2p_leader_replica election replication (barrel_p2p_replica)
|
+- barrel_p2p_shard sharded placement (HRW ring)
+- barrel_p2p_members_replica lease-based live-node set (barrel_p2p_replica)
+- barrel_p2p_reminder durable reminders (owner-fires-once)
+- barrel_p2p_reminder_replica reminder store (barrel_p2p_replica)
|
+- barrel_p2p_proxy_sup (simple_one_for_one)
| +- barrel_p2p_service_proxy * N per-name remote proxies
|
+- barrel_p2p_router service overlay routing cache
+- barrel_p2p_streams tagged user-stream demultiplexer
+- barrel_p2p_bridge dist connection bookkeeping
+- barrel_p2p_dist_gc idle dist channel reaperThe vertical structure is not random. barrel_p2p_hlc and
barrel_p2p_dist_keys start first because every other subsystem
depends on monotonic timestamps and on the trust store being
available. The HyParView subtree is restart-as-a-block
(rest_for_one): if the state machine crashes, the event bus,
shuffle timer, and cleanup timer all restart together, which keeps
the cluster's view of the node coherent.
HyParView: how membership stays bounded
HyParView is a partial-view membership protocol. Each node holds two sets:
- The active view, a small bounded set (default 5) of peers this node currently exchanges gossip with. Active links are symmetric: if A has B in its active view, B has A in its active view.
- The passive view, a larger bounded cache (default 30) of known peers that are not currently active. Members of the passive view are warm spares used when an active link drops.
The key property is that the active view does not need to grow with the cluster. A node with thousands of peers in its passive view still keeps only five active links; messages reach the rest of the cluster by being forwarded along the active links of other nodes.
Read this graph from node A. The active view is the maintenance topology. The passive view is a reserve of known peers. Neither is the full set of nodes your Erlang code may eventually talk to.

Joining the cluster
When a new node joins, it sends a JOIN message to one contact
node it knows about (either explicitly through barrel_p2p:join/1
or implicitly through the contact_nodes config key):
NewNode ----JOIN----> ContactNode
|
+--- adds NewNode to its active view
|
+--- sends FORWARD_JOIN(TTL=ARWL) to a
random active peer (and so on, until
TTL=0)Forwarding has two purposes. First, it spreads knowledge of the new node across the cluster without flooding. Second, when TTL reaches zero, the receiving node adds the new node to its own passive view, which becomes a warm spare for future use.
The TTL (arwl, default 6) controls how widely the new node
propagates. A smaller value means a more local join; a larger
value spreads farther but costs more messages.
Failure handling
A node failure is detected through the dist channel:
net_kernel reports a nodedown event, which barrel_p2p translates
into a HyParView failure. The protocol then:
- Moves the failed peer from active to a transient "failed" state with an exponential backoff timer.
- After
max_fail_countconsecutive failures (default 5), moves the peer to the passive view. - Picks a replacement from the passive view and sends a
NEIGHBORrequest. If the candidate accepts, the active view is back to its target size.
Backoff prevents reconnection storms during network partitions: if half the cluster becomes unreachable at once, each surviving node retries its neighbours on staggered timers rather than all at once.
Shuffle: keeping passive views fresh
Periodically (every shuffle_period ms, default 10s), each node
picks a random active peer and exchanges a small sample of its
known peers. This is how new members reach corners of the cluster
that did not see the original FORWARD_JOIN, and how the passive
view stays large enough that there is always a spare to replace a
failed active peer.
The exchange is bounded: a fixed-size random sample, never the full view.
Plumtree: broadcasting changes efficiently
Plumtree (Push-Lazy-Push Multicast Tree) is the protocol that moves registry updates and other gossip across the cluster. The input is "broadcast this message"; the output is that every peer eventually sees the message exactly once, even under churn.
The idea: each node classifies its active-view peers into two sets.
- Eager peers receive the full message body.
- Lazy peers receive only the message identifier (an
IHAVEannouncement); they fetch the body only if they have not seen it.
The first time a message goes out, all peers are eager. When a
duplicate arrives (because some other path got there first), the
recipient sends a PRUNE, demoting the sender to lazy. The tree
self-organises into a spanning structure where each message
flows through one path; the lazy backups cover the case where a
node drops mid-flight.
If a node receives an IHAVE for a message it never sees, it
sends GRAFT to the lazy peer, promoting it back to eager so the
message is re-pushed. This is the self-healing path.
The two interesting consequences:
- Broadcast cost is O(n) messages, not O(n log n) or worse.
- A single peer failure costs at most one
GRAFT/PRUNEexchange, not a full re-flood.
The service registry: an OR-Map CRDT
The service registry is the part of barrel_p2p that requires the most thought to use correctly. We use a CRDT because we want registration to work without coordination: any node can register a service at any time, and all nodes converge to the same view without locking.
The data structure is an Observed-Remove Map (OR-Map). A few properties worth stating explicitly:
- Add and remove commute. Two concurrent adds of the same name produce two entries; if a third node later removes the name, only the additions visible to that third node are removed.
- Tombstones are bounded. Each remove carries the set of dots it observed; once every node has applied the remove, the dots can be discarded.
- Causal merging. Two replicas merge by union of dots plus the rule that a tombstoned dot stays tombstoned.
Each add gets a unique dot, which is a {node, hlc_timestamp}
pair. The hybrid logical clock guarantees that two adds from the
same node are ordered, and that two adds from different nodes can
be compared causally.
-type dot() :: {node(), barrel_p2p_hlc:timestamp()}.
-type or_map() :: #{
Key => {
dots :: sets:set(dot()),
values :: [{dot(), Value}]
}
}.Operationally: when you call register_service/2, the registry
adds an entry with a fresh dot, and its barrel_p2p_replica
instance broadcasts the delta over Plumtree. When the broadcast reaches
peer B, B merges the delta into its local OR-Map; from then on
B's whereis_service/1 can find the new registration.
Sharded placement and the live-node set
barrel_p2p_shard answers "which node should own this key" with
rendezvous (HRW) hashing over a replicated live-node set. The set is
NOT the bounded HyParView active view and is not driven by peer_down
(which is active-view churn, not cluster death). Instead each node
gossips a periodic heartbeat carrying its wall-clock time through its
own barrel_p2p_replica instance (barrel_p2p_members_replica); a node is
in the ring while its lease is fresh (Now - heartbeat =< member_ttl_ms),
and heartbeats too far in the future are rejected so a fast clock cannot
pin a dead node. The set converges without tombstones: a stale entry in
a full-sync is already expired by its timestamp.
The live member list is published to a read-concurrency ETS table, so
place/1, owners/2, is_owner/1, and partition/1 are lock-free
local reads. When the live set changes, the shard diffs the partitions
this node owns and emits {barrel_p2p_shard, {acquired | released, P}}
to subscribers, which is how consumers hand off partitioned state.
Durable reminders
barrel_p2p_reminder layers fire-at-most-once timers on placement. A
reminder Key => {FireAt, Payload, Version} lives in its own
barrel_p2p_replica instance (barrel_p2p_reminder_replica), so every node
holds it and it survives the node that armed it. The owner of a reminder
is barrel_p2p_shard:place(Key); only the owner arms a local
erlang:send_after and fires. The timer is a versioned hint: on timeout
the owner re-checks that the reminder still exists, still names the same
version, is still owned here, and is actually due. Firing is
tombstone-first (gossip the removal, then deliver
{barrel_p2p_reminder, Key, Payload, Fence} locally), and ownership
events plus a periodic reminder_scan_ms sweep re-arm a survivor's
inherited reminders. The result is exactly-once in steady state and
best-effort under churn or a crash at the fire instant; Fence (the
packed version) lets a handler dedup.
Hybrid logical clocks
Standard wall-clock timestamps can move backwards if NTP corrects a drifting clock, and they cannot order events from different nodes consistently. A pure logical clock (a Lamport clock) orders events but loses the connection to physical time. Barrel P2P uses hybrid logical clocks (HLC), which combine the two.
The shape is {wall_ms, logical}:
wall_msis the current wall time in milliseconds.logicalis a counter that breaks ties when two events share a wall-time.
The clock has two operations:
barrel_p2p_hlc:now/0produces the next local timestamp, ensuring monotonic progress relative to the previous local timestamp.barrel_p2p_hlc:update/1accepts a timestamp from a peer and advances the local clock to be greater than both the local reading and the peer's timestamp.
HLC timestamps serve two roles in barrel_p2p:
- They are the
dotcomponent in OR-Map adds. This is where causality across nodes lives. - They are exposed to applications that need cluster-wide
ordering without coordinating; the
barrel_p2p_hlcmodule is public.
Authentication: Ed25519 over a QUIC stream pair
Authentication runs between the QUIC TLS handshake and the Erlang dist handshake. The flow is a mutual challenge-response on a pair of unidirectional QUIC streams:
Node A (client) Node B (server)
| |
| open uni stream 2 |
|-- HELLO(node_A, pubkey_A) ----------------->|
| |
| open uni stream 3
|<------------------------ HELLO(node_B, pubkey_B)
| |
|-- CHALLENGE(nonce_A, wall_ts_A) ----------->|
|<------------ CHALLENGE(nonce_B, wall_ts_B) |
| |
|-- RESPONSE(sign(nonce_B, ts_B, pubkey_A)) ->|
|<----------- RESPONSE(sign(nonce_A, ts_A, pubkey_B))
| |
|-- OK -------------------------------------->|
|<----------------------------------------- OK
v v
Erlang dist handshake (cookie)The signed message includes the responder's own public key. That binds the signature to the identity the responder claims, so a peer cannot relay a signature to impersonate someone else.
Trust modes operate on what happens when the presented public key is not the one already pinned for that node atom:
| Mode | No pin yet | Pin matches | Pin differs |
|---|---|---|---|
tofu | accept and pin | accept | reject |
strict | reject | accept | reject |
The trust store lives on disk under data/keys/trusted/,
one file per peer (<node-atom>.pub). Writes are atomic
(write-then-rename) and use 0600 permissions; see
authentication.md for the full lifecycle.
The dist carrier: barrel_p2p_dist + quic_dist
barrel_p2p_dist is the proto_dist module Erlang loads when you
boot with -proto_dist barrel_p2p. It is intentionally a thin
shim: the bulk of the QUIC transport is upstream
quic_dist.
What barrel_p2p_dist:listen/1 does, in order:
- Ensures TLS material. If
data/quic/node.crtanddata/quic/node.keyexist, it uses them; otherwise it generates a self-signed pair viabarrel_p2p_quic_cert. - Projects defaults into the
quic.distapp env. Setsauth_callback => {barrel_p2p_dist_auth_callback, authenticate},discovery_module => barrel_p2p_discovery, and the cert/key paths. User-supplied values under{quic, [{dist, [...]}]}insys.configalways win; this step only fills gaps. - Validates the projected config. If
auth_enabledistruebut the projectedauth_callbackisundefined(because the user explicitly nulled it), boot fails loudly rather than silently shipping an unauthenticated cluster. - Delegates to
quic_dist:listen/1.
For outgoing connections, the setup/5 callback runs the
auth-callback in the same process that initiated the QUIC connect,
so we can pass the dialed node atom to the callback through the
process dictionary. This is needed so the client side of the
handshake can check cookie_only_nodes for the target (the node
we asked to connect to) rather than the peer's self-reported
identity.
Discovery
The default discovery_module is barrel_p2p_discovery, a
composing dispatcher: it asks each backend in a chain until one
returns a hit, then caches the result. The default backend chain
is:
- Static. Reads
{quic, [{dist, [{nodes, [...]}]}]}from sys.config. The shape is{NodeAtom, {Host, Port}}. Useful in docker-compose, in tests, and any environment with a fixed topology. - File. Reads
data/discovery/<node>.endpointfiles. Useful on a single host where every node writes its own endpoint to a shared directory. - DNS. Resolves
<host>portions of node atoms via DNS. Useful in environments with proper DNS plumbing.
You can replace the chain entirely by setting
barrel_p2p.discovery_backends in sys.config.
The idle dist-channel GC
Pid ! Msg to any cluster node works through OTP's demand-driven
auto-connect. That can open a dist channel that the application
then never uses again, which would accumulate over time. The dist
GC reaps such channels.
The flow below is the reason the GC exists. A service lookup may return a pid on a node outside the local active view. Sending to that pid opens an authenticated QUIC dist channel. If the application does not keep using it, Barrel P2P closes it later.

The predicate is conservative. A channel is eligible for reaping when all of the following hold:
- The peer is not in the local HyParView active view.
quic_dist:list_streams/1returns the empty list (no live user streams are riding the dist channel).- The channel is older than
dist_gc_min_age_ms(default 5 minutes).
A reaped channel is closed cleanly. If the application sends to the same peer later, a new dist channel opens on demand.
This GC is unconditionally on. The decoupled-from-active-view design relies on its presence; see features.md for the stability tier.
Service overlay routing and proxies
whereis_service/1 resolves a service name in three steps:
- Look up locally.
- Look up in the local cache (populated by gossip).
- If neither, ask through the overlay.
The overlay step uses barrel_p2p_router: it sends a route request
to a random active peer with a TTL, which forwards in turn until
a peer finds the service or the TTL runs out. The path is cached
on success so subsequent lookups are direct.
When the resolved service is on a remote node, whereis_service/2
optionally hands you a service proxy: a local pid that
forwards gen_server calls and casts to the remote service over
the dist channel. The proxy is what makes
{via, barrel_p2p, Name} registrations transparent: a caller can
gen_server:call({via, barrel_p2p, my_service}, request) and the
proxy handles the remote dispatch.
Proxies are reference-counted and reaped when the remote service goes down.
Tagged-stream multiplex
The dist channel between two peers is multiplexed: the Erlang
dist control stream is one QUIC bidirectional stream, but the
application can open additional streams alongside it. Barrel P2P
exposes that as barrel_p2p_streams, a single demultiplexer per
node.
Wire format: every barrel_p2p-managed user stream starts with
<<TagLen:8, Tag:TagLen/binary, Payload/binary>>The demuxer reads the first 1 + TagLen bytes, looks up the
registered acceptor for the tag, and hands the stream to that
acceptor. From then on the acceptor owns the stream and uses
quic_dist:send/2 / close_stream/1 directly; the demuxer is
off the data path.
The cap on parked-but-not-yet-dispatched streams is small (currently 64); a peer that opens many streams and drips bytes without ever completing the tag preamble has its excess streams reset.
Connection migration
A single QUIC connection can rebind to a new local UDP 4-tuple
(NIC change, IP change, default-route change) without losing
keys, streams, or ordering. barrel_p2p:migrate_peer/1,2 exposes
that primitive as a synchronous call.
The decision of when to migrate is the application's. Barrel P2P
provides the trigger and the path statistics
(barrel_p2p_path_stats:srtt/1); a watchdog can poll, evaluate, and
call.
The motivating cases are: a mobile node moves between Wi-Fi and cellular; a server's outbound IP changes because of a CGNAT shuffle; a peer is being routed through a different relay. See migration.md for the recipe.
Observability
Every metric barrel_p2p emits goes through barrel_p2p_metrics. The
catalog is in observability.md; the design
note for this document is: emit sites are wrapped in a
try/catch, so a misconfigured exporter cannot crash protocol
code.
Reading the source
If you want to follow a code path end-to-end, the natural seams are:
- A
barrel_p2p:join/1call. Start inbarrel_p2p_hyparview'shandle_call({join, ...})and follow thebarrel_p2p_bridgerequest, the QUIC connect, the auth callback, the dist handshake, thepeer_upevent. - A
register_service/2call. Start inbarrel_p2p_registry'shandle_call({register, ...})and follow the OR-Map add, thebarrel_p2p_replicabroadcast, thebarrel_p2p_plumtreefanout, and the merge on a remote node. - A
whereis_service/1call. Start inbarrel_p2p.erl, see how the local lookup is tried first, then the cache, then the overlay route request.
The test suites that exercise each path are named after the
module under test (test/barrel_p2p_hyparview_SUITE.erl,
test/barrel_p2p_registry_SUITE.erl, test/barrel_p2p_router_SUITE.erl,
etc.). testing.md lists them and explains the
docker-only suite for the full transport behaviour.