Distributed Familiar

Copy Markdown View Source

Cantrip's distributed story uses ordinary BEAM distribution. Cantrip does not discover clusters for you; start named nodes, share an Erlang cookie, connect the nodes, then let Cantrip use those nodes for Mnesia loom replication and remote child cantrips.

Node Setup

Run each host as a named node with the same cookie:

iex --name analysis@host-a --cookie "$CANTRIP_COOKIE" -S mix
iex --name agents@host-b --cookie "$CANTRIP_COOKIE" -S mix

Connect nodes using your deployment's normal mechanism:

Node.connect(:"agents@host-b")

Cluster discovery is deliberately out of scope. libcluster, Kubernetes headless services, static config, or manual Node.connect/1 all work as long as the BEAM nodes can reach each other and authenticate with the same cookie.

Replicated Mnesia Loom

Once nodes are connected, join Mnesia to the remote DB node and replicate the loom table:

table = :cantrip_familiar_loom
nodes = [:"agents@host-b"]

{:ok, _connected} = Cantrip.Cluster.connect_mnesia(nodes)
:ok = Cantrip.Cluster.replicate_table(table, nodes, copy_type: :disc_copies)

{:ok, familiar} =
  Cantrip.Familiar.new(
    llm: llm,
    root: File.cwd!(),
    loom_storage: {:mnesia, table: table}
  )

connect_mnesia/2 wraps :mnesia.change_config(:extra_db_nodes, nodes). replicate_table/3 converts the local table copy and adds remote table copies. Use copy_type: :ram_copies for ephemeral test clusters; use :disc_copies for durable deployment nodes.

The launcher mix cantrip.familiar already promotes the current BEAM to a workspace-stable node when using the default Mnesia loom. In a cluster, start with explicit node names and cookies so all nodes agree on identity.

Remote Child Cantrips

Child cantrip configs may include :node. When the node is remote, Cantrip.new/1 builds the child on that node with a bounded RPC call, and Cantrip.cast/3 runs the episode on that node. Parent observations still receive the child result and loom turns, so the local Familiar's loom keeps the delegation trace.

{:ok, reader} =
  Cantrip.new(%{
    node: :"agents@host-b",
    identity: %{system_prompt: "Read files and return concise excerpts."},
    circle: %{type: :code, gates: ["read_file", "done"], wards: [%{max_turns: 2}]}
  })

{:ok, text, reader, child_loom, meta} =
  Cantrip.cast(reader, "Read README.md")

From the Familiar's code medium, the same shape works:

{:ok, reader} = Cantrip.new(%{
  node: :"agents@host-b",
  identity: %{system_prompt: "Read README.md and return the first paragraph."},
  circle: %{type: :code, gates: ["read_file", "done"], wards: [%{max_turns: 2}]}
})

{:ok, paragraph, _reader, _loom, _meta} = Cantrip.cast(reader, "Read README.md")
done.(paragraph)

Remote casts intentionally do not stream local process events across nodes in this first version. The request/response result and child loom are returned; fire-and-forget inter-entity messaging remains future work.

Remote RPC calls use the application environment key :rpc_timeout under the :cantrip application and default to 30 seconds:

Application.put_env(:cantrip, :rpc_timeout, 30_000)

Unknown string node names fail closed. A string node name is accepted only when it is already this node, already present in Node.list/0, or already exists as an atom in the VM. Connect the node before handing its string form through a serialized Familiar boundary.

Trust Boundary

Every node in a distributed Erlang cluster is fully trusted. A connected peer with the Erlang cookie can execute code on the node and can bypass Cantrip wards by operating below the Cantrip API. Treat the cookie and network reach as the trust boundary; do not cluster Cantrip nodes across tenants or trust domains.

Failure Modes

Cantrip bounds remote Cantrip.new/1 and Cantrip.cast/3 calls with :rpc.call/5, so a wedged peer returns an error instead of hanging the caller forever. Node-down, timeout, and remote exception failures are returned as ordinary {:error, reason, next_cantrip} or {:error, reason} shapes, depending on whether a reusable cantrip handle already exists.

Mnesia replication still follows Mnesia's operational model. Network partitions can produce divergent disc_copies; recovery policy is an operator concern, not automatic conflict resolution inside Cantrip. For audit-trail looms, prefer a topology that avoids multi-writer partitions, monitor Cantrip.Cluster.connect_mnesia/2 and replicate_table/3 failures, and verify table health after reconnects before relying on the replicated loom as a canonical record.