DurableServer behaviour (durable_server v0.1.1)
DurableServer provides durable, distributed GenServer processes backed by pluggable storage.
DurableServer implements fault-tolerant, stateful processes that can survive node failures, restarts, and deployments by automatically persisting state to storage and coordinating across a distributed cluster.
Key Features
- Durable state: Automatically persists state to storage with configurable sync intervals
- Cluster coordination: Uses distributed registry for process discovery and health monitoring
- Capacity-aware placement: Monitors CPU, memory, and disk usage to route new processes to nodes with available capacity
- Sticky placement: Environment variable-based placement preferences (e.g., same machine,
same region via
FLY_REGION, etc.) with time-gated fallback to preferred nodes - Automatic recovery: Failed processes are detected and restarted across the cluster
- Graceful shutdown: Ensures state is synchronized before termination
via
DurableServer.Terminator
Architecture
DurableServers must be started through DurableServer.Supervisor, which provides:
- Prefix-based isolation between different supervisor instances
- Graceful shutdown coordination via Terminator GenServer
- Automatic lifecycle management and restart capabilities with coordination across the cluster
See DurableServer.Supervisor for supervisor setup and configuration options.
Basic Usage
defmodule MyCounterServer do
use DurableServer, vsn: 1
def dump_state(state) do
%{count: state.count}
end
def load_state(_old_vsn, %{"count" => count} = _dumped_state) do
%{count: count}
end
def init(%{count: count} = state) do
IO.puts("Starting with count #{count}")
{:ok, Map.merge(state, %{started_at: DateTime.utc_now()}), permanent: true}
end
def handle_call(:increment, _from, state) do
new_state = %{state | count: state.count + 1}
{:reply, new_state.count, new_state}
end
def handle_call(:get_count, _from, state) do
{:reply, state.count, state}
end
def handle_call(:reset, _from, state) do
{:reply, :ok, %{state | count: 0}}
end
end
# Start the supervisor first (typically in your application.ex supervision tree):
children = [
...,
{DurableServer.Supervisor, name: MyDurableSup, prefix: "durable/"}
]
# or start directly if you simply want to demo:
{:ok, supervisor_pid} = DurableServer.Supervisor.start_link(
name: MyDurableSup,
prefix: "durable/"
)
# Start individual servers through the supervisor
{:ok, {pid, _meta}} = DurableServer.Supervisor.start_child(
MyDurableSup,
{MyCounterServer, key: "user_123", initial_state: %{count: 0}}
)
# Use the server
GenServer.call(pid, :increment) # => 1
GenServer.call(pid, :increment) # => 2
GenServer.call(pid, :get_count) # => 2Note: for releases, :os_mon must be added to extra_applications in mix.exs:
def application do
[
mod: {My.Application, []},
extra_applications: [:logger, :runtime_tools, :os_mon]
]
endAdvanced Example: Session Manager
defmodule UserSessionServer do
use DurableServer, vsn: 2
def dump_state(state), do: Map.take(state, [:user_id, :session, :last_activity_at])
# migration logic for version 1 -> 2
def load_state(vsn, dumped_state) do
case vsn do
1 ->
# migrate to v2 logic
_ ->
%{
user_id: Map.fetch!(dumped_state, ["user_id"]),
session: Map.get(dumped_state, "session" || %{},
last_activity: dumped_state["last_activity_at"],
}
end
end
def init(%{} = loaded_state) do
init_state = %{loaded_state | last_activity_at: System.system_time(:millisecond)}
{:ok, init_state, sync_every_ms: 30_000}
end
def handle_call({:update_session, func}, _from, state) do
%{} = new_session = func.(state.session)
new_state = %{state | session: new_session, last_activity: System.system_time(:millisecond)}
{:reply, :ok, new_state}
end
def handle_call(:get_session, _from, state) do
{:reply, state.session, %{state | last_activity: System.system_time(:millisecond)}}
end
def handle_call(:logout, _from, state) do
{:stop, :normal, :ok, %{state | last_activity: System.system_time(:millisecond)}}
end
endConfiguration Options
DurableServer supports these options in the init/1 or init/2 return tuple:
:auto_sync- Enable automatic periodic syncing (default: false):sync_every_ms- Sync interval in milliseconds (default: 30_000):meta- Optional metadata to include for the globally registered server which is returned alongside the pid withDurableServer.Supervisor.lookup/2.:permanent- Mark server for automatic restart by LifecycleManager (default: false)
Accessing Runtime Info
DurableServer provides runtime information through the optional init/2 callback.
The info map contains supervisor references and any user-defined data configured
via the supervisor's :init_info option.
Built-in Keys
The following keys are always present in the info map:
:key- DurableServer key:supervisor- TheDurableServer.Supervisorname:task_supervisor- Task supervisor for spawning async tasks:dynamic_supervisor- The DynamicSupervisor managing DurableServer processes
User-defined Keys
Pass custom data to all servers via the supervisor's :init_info option:
# In your supervision tree
{DurableServer.Supervisor,
name: MyApp.DurableSup,
prefix: "myapp/",
init_info: %{api_client: MyApp.APIClient, config: %{timeout: 5000}}}Then access it in your server's init/2:
def init(state, info) do
api_client = info.api_client
timeout = info.config.timeout
{:ok, %{state | api_client: api_client, timeout: timeout}}
endChoosing Between init/1 and init/2
- Use
init/1if you don't need access to supervisor references or custom init_info - Use
init/2if you need the task supervisor, dynamic supervisor, or custom data
Both callbacks are optional. If you implement init/2, it takes precedence.
If neither is implemented, the default init/1 returns {:ok, state}.
State Synchronization
State is synchronized to storage in these scenarios:
- Manual sync: Return
:syncfrom any callback, ie:{:noreply, state, :sync}You can also combine sync with other actions via callback options, e.g.{:noreply, state, {:continue, term}, sync: true}. - Automatic sync: When
:auto_syncis enabled all changes are immediately written when any callback returns, or the:sync_every_msinterval can be provided to periodically sync changes. - Graceful shutdown: Automatically synced during normal termination, ie: cold deploys
- Before stopping: When returning
{:stop, reason, state}from callbacks
Stopping Behavior
DurableServer supports different stop reasons with specific behaviors regarding exit signal propagation:
Shutdown-wrapped stops (exit signal propagates to linked processes)
{:stop, {:shutdown, :delete}, state}- Stops and deletes from storage, exit signal propagates{:stop, {:shutdown, :permanent}, state}- Stops permanently, exit signal propagates.:permanentstop will make the server no longer elligable for permanent restarts and it will remain stopped until explicitly started byDurableSuper.Supervisor.start_child/2.{:stop, {:shutdown, :normal}, state}- Normal stop, exit signal propagates (syncs as stopped_graceful)
Shutdown-wrapped exits propagate to linked processes (allowing them to react) but don't kill them.
Non-shutdown stops (exit signal does NOT propagate to linked processes)
{:stop, :delete, state}- Stops and deletes, silent termination (no exit signal){:stop, :permanent, state}- Stops permanently, silent termination (no exit signal){:stop, :normal, state}- Normal stop, silent termination (syncs as stopped_graceful)
Non-shutdown stops are transformed to :normal exits which don't propagate to linked processes.
Error stops
{:stop, {:error, reason}, state}- Stops with error, marks as crashed, exit signal propagates
Use shutdown-wrapped stops when linked processes need to be notified of the shutdown. Use non-shutdown stops for silent termination without notifying linked processes.
Error Handling and Recovery
DurableServers are designed to be resilient:
- Process crashes:
LifecycleManagerdetects failures and restarts servers - Node failures: Other nodes claim and restart orphaned processes
- Storage failures: Retries and graceful degradation where possible
- Region-aware network partitions: Consistent hashing ensures only one node manages each key and places servers in their initial region where possible
Best Practices
- Always use DurableServer.Supervisor: Never start DurableServers directly
- Design for restarts: Assume your process can be restarted on any node at any time
- Ensure
load_state/2handles migrations and avoids side effects You must implement state migrations for schema changes across code changes, which is handled by bumping your:vsnoption touse DurableServerand matching in yourload_state/2on old versions.
Note: A lock is not aquired until init/1 is entered, so your load_state/2 callbacks should always
be a pure function without side effects. ie if you need process messaging, pubsub, or to perform work
on process start, do so after loading your state within init/1.
- Consider appropriate sync intervals: Balance durability vs performance needs
Distribution and Clustering
DurableServers work seamlessly in distributed environments:
- Processes register in a cluster-wide registry with their unique keys
- Permanent servers are started across the cluster and guarantee only a single key is started globally at a given time
- Servers can be configured with sticky placement preferences to restart on the same machine or in the same region where they were running
- Health monitoring detects failures across the cluster
- Automatic failover ensures high availability
See DurableServer.Supervisor documentation for cluster configuration options.
Capacity-Aware Placement
DurableServers support automatic capacity-aware placement with remote fallback.
Local Placement (Default)
When starting a child, the local node is tried first. If capacity limits are exceeded, remote placement is attempted automatically.
Remote Placement
If local capacity is exhausted, DurableServer automatically tries remote nodes:
- Same-region nodes first - Prioritizes nodes in the same region for lower latency
- Least busy nodes - Selects nodes with the lowest utilization across all limits
- Configurable retries - Default 3 remote nodes tried, configurable via
max_placement_retries
Capacity Limits
Configure capacity limits when starting a supervisor:
{DurableServer.Supervisor,
name: MyDurableSup,
prefix: "durable/",
max_children: %{
:total => 100, # Max total children on this node
MyModule => 50 # Max MyModule children on this node
},
max_cpu: 80, # Max CPU % before rejecting
max_memory: 85, # Max memory % before rejecting
max_disk: {90, "/data"}} # Max disk % on mount point before rejectingUnlike CPU and memory limits, disk limits are bypassed for sticky restarts (children returning to their previous node) since part of the disk usage is the child's own data.
Placement Options
Control remote placement behavior per start_child call:
# Default: Try local, then up to 3 remote nodes
DurableServer.Supervisor.start_child(sup, {MyServer, key: "user_1", initial_state: %{}})
# Local only, no remote fallback
DurableServer.Supervisor.start_child(sup, {MyServer, key: "user_1", initial_state: %{}},
max_placement_retries: 0)
# Try local, then up to 5 remote nodes
DurableServer.Supervisor.start_child(sup, {MyServer, key: "user_1", initial_state: %{}},
max_placement_retries: 5)Note: Automatic restarts from LifecycleManager always use max_placement_retries: 0
to place processes on their current node only, deferring to other node LifecycleManagers to
manager their own node-local placement.
See DurableServer.Supervisor for full configuration details.
Sticky Placement
Sticky placement allows DurableServers to prefer restarting on nodes with specific characteristics (e.g., same machine, same region) before falling back to other nodes. This is particularly useful for things like Litestream-backed databases to avoid unnecessary S3 restores when the database is already available locally.
Sticky Configuration
Configure sticky placement per-module when starting a supervisor using a keyword list where keys are environment variable names (as atoms) and values are delay times in milliseconds:
{DurableServer.Supervisor,
name: MyDurableSup,
prefix: "durable/",
sticky_placement: %{
MyDatabaseServer => [
FLY_MACHINE_ID: 10_000,
FLY_REGION: 20_000,
any: 0
]
}}Sticky placement uses environment variables to create a progressive fallback strategy with cumulative time windows. Each delay value specifies how much time to add before the next level can claim. From the above configuration:
- Level 0 (immediate): Only nodes matching
FLY_MACHINE_IDcan claim - Level 1 (after 10s): Nodes matching
FLY_REGIONcan claim - Level 2 (after 30s): Any node (
:any) can claim
The delays are cumulative - each level unlocks at the sum of all previous delays:
- Level 0 unlocks at 0ms (always immediate)
- Level 1 unlocks at 10,000ms (sum of delays before level 1)
- Level 2 unlocks at 30,000ms (10s + 20s)
The last level's delay value is unused (no subsequent level), so 0 is conventional.
Earlier levels remain eligible even after later levels unlock, maintaining preference order.
Common Patterns
Machine stickiness with region fallback (no :any):
sticky_placement: %{
MyServer => [
FLY_MACHINE_ID: 20_000,
FLY_REGION: 0
]
}Same machine claims immediately, same region claims after 20s. Without :any, nodes
in other regions can never claim - the server will only run in its original region.
Region stickiness, falling back to any node:
sticky_placement: %{
MyServer => [
FLY_REGION: 20_000,
any: 0
]
}Same region claims immediately, any node can claim after 20s.
Custom environment variables:
sticky_placement: %{
MyServer => [
DATACENTER: 15_000,
AVAILABILITY_ZONE: 30_000,
any: 0
]
}Same datacenter claims immediately, same availability zone after 15s, any node after 45s.
Strict region pinning (no fallback):
sticky_placement: %{
MyServer => [
FLY_REGION: 0
]
}Only nodes with matching FLY_REGION can claim, and they can claim immediately.
Without :any, non-matching nodes can never claim the server - it will only run
on nodes with the same FLY_REGION as where it was originally started. Use this when
data locality is critical and you'd rather the server stay down than run in the wrong
location.
Default Sticky Placement
Apply the same sticky placement configuration to all modules:
{DurableServer.Supervisor,
name: MyDurableSup,
prefix: "durable/",
default_sticky_placement: [
FLY_REGION: 20_000,
any: 0
]}Per-module configurations override the default.
Updating Sticky Placement Configuration
When a DurableServer starts, its sticky placement is captured based on the module configuration and the node's current environment variables. This placement is persisted with the server's state in object storage.
If you later change the module's sticky placement configuration (for example, adding
:any as a fallback level), running servers retain their original placement from when
they started. To ensure proper orphan claiming behavior, the lifecycle manager automatically
augments persisted placement with the :any level if present in the updated module config.
For example, if you change from:
sticky_placement: %{MyServer => [FLY_MACHINE_ID: 60_000, FLY_REGION: 0]}To:
sticky_placement: %{MyServer => [FLY_MACHINE_ID: 60_000, FLY_REGION: 120_000, any: 0]}Servers started before the change will have their persisted placement augmented with the
:any level at runtime. This ensures they can still be claimed by any node after their
specific placement preferences are exhausted, using the delay specified in the module config.
Other environment variable levels cannot be added retroactively since their values were determined when the server originally started.
Important Notes
- Environment variable values are captured when the server first starts
- Values are stored in the server's metadata in object storage
- nil environment variable values are preserved and can match
- The
:anyatom matches any node, regardless of environment variables - Time windows are cumulative, not independent intervals
- Earlier preference levels remain eligible after later levels unlock
Monitoring Events with Group
DurableServer uses Group for distributed process groups, registry, and lifecycle monitoring.
You can call into the Group instance of your Supervisor to monitor DurableServer events:
# Monitor a specific key
:ok = Group.monitor(MyDurableSup, "user/123")
# Monitor all keys with a prefix
:ok = Group.monitor(MyDurableSup, "user/")
# Monitor all events
:ok = Group.monitor(MyDurableSup, :all)Monitors receive {:group, events, info} tuples in their mailbox:
def handle_info({:group, events, _info}, state) do
Enum.each(events, fn
%Group.Event{type: :registered, key: key, pid: pid, previous_meta: nil} ->
# A DurableServer started (previous_meta is nil for first registration)
:ok
%Group.Event{type: :unregistered, key: key, reason: reason} ->
# A DurableServer stopped
:ok
_ -> :ok
end)
{:noreply, state}
endEvent types: :registered, :unregistered, :joined, :left
:registered and :joined events include a previous_meta field (nil for new, old meta
for re-register/re-join). Single operations produce one event per tuple; bulk operations
(nodedown, process death) batch all events together.
Joining as a Member
Non-DurableServer processes can join keys to be discoverable and receive dispatched messages:
# Join a key (e.g., from a Phoenix Channel)
:ok = Group.join(MyDurableSup, "room/123", %{type: :channel})
# Re-joining updates metadata in place
:ok = Group.join(MyDurableSup, "room/123", %{type: :channel, status: :active})
# Query all members of a key (DurableServers + joined processes)
members = Group.members(MyDurableSup, "room/123")
# => [{#PID<0.150.0>, %{...}}, {#PID<0.200.0>, %{type: :channel, status: :active}}]
# Leave when done (also happens automatically on process death)
:ok = Group.leave(MyDurableSup, "room/123")Dispatching to Members
Send messages to all members of a key:
# From a DurableServer, broadcast to all connected channels
Group.dispatch(MyDurableSup, state.key, {:new_message, message})Monitor vs Join
monitor/2: Receive lifecycle events (:registered,:unregistered,:joined,:left) - system-generatedjoin/3: Be discoverable viamembers/2and receivedispatch/3messages - application-level
These are independent - joining does not monitor events, and monitoring does not make you discoverable.
Summary
Callbacks
Optional callback invoked after terminate/2 and after final status sync.
Transform user state into a map for persistence.
Initializes the DurableServer with loaded state.
Transform backend-decoded persisted state back into user state format.
Functions
Returns a specification to start this module under a supervisor.
Attempt to atomically claim a restart attempt for a server.
Clear restart attempt metadata from a server object.
Fetches the DurableServer's current state from storage.
Get just the metadata for a server without the full object.
Types
@type callback_options() :: [callback_option()]
@type init_option() :: {:auto_sync, boolean()} | {:sync_every_ms, pos_integer()} | {:meta, map()} | {:permanent, boolean()}
@type sync_action() :: :sync
@type timeout_action() :: timeout() | :hibernate | {:continue, term()} | sync_action()
@type user_meta() :: map()
@type user_stop_reason() :: nil | :normal | :delete | :permanent | {:shutdown, :delete} | {:shutdown, :permanent} | {:shutdown, :normal} | {:error, term()}
Callbacks
Optional callback invoked after terminate/2 and after final status sync.
This callback is only invoked when the final status sync completed successfully
for a graceful stop (final_status: :stopped_graceful and sync_result: :ok).
The first argument is exactly the return value from terminate/2.
The second argument is an info map:
:key- DurableServer key:supervisor- Supervisor name:final_status- Final persisted status atom:sync_result-:ok | {:error, term()}:reason- Termination reason passed toterminate/2
Transform user state into a map for persistence.
This required callback is used when saving state through the configured storage backend. It allows you to:
- Filter out keys that shouldn't be persisted (like PIDs, refs, etc.)
- Transform the state shape for storage
- Remove ephemeral data
The returned value must be a plain map at the top level. Nested values are passed through to the configured backend as-is, so they only need to be encodable by the backend you are using.
This means persisted shapes may differ by backend. For example:
DurableServer.Backends.ObjectStoretypically encodes to and decodes from JSON-shaped data with string keysDurableServer.Backends.EKVStoremay preserve richer Elixir terms
If you plan to move data between backends, load_state/2 should be prepared to
handle multiple persisted shapes during the migration window.
Examples
def dump_state(%{count: count, temp_data: _temp} = state) do
# Only persist count, filter out temp_data
%{count: count}
end
@callback handle_call(request :: term(), from :: GenServer.from(), state :: term()) :: {:reply, reply, new_state} | {:reply, reply, new_state, timeout_action()} | {:reply, reply, new_state, callback_options()} | {:reply, reply, new_state, timeout_action(), callback_options()} | {:noreply, new_state} | {:noreply, new_state, timeout_action()} | {:noreply, new_state, callback_options()} | {:noreply, new_state, timeout_action(), callback_options()} | {:stop, reason, reply, new_state} | {:stop, {:shutdown, :delete}, reply, new_state} | {:stop, {:shutdown, :permanent}, reply, new_state} | {:stop, :delete, reply, new_state} | {:stop, :permanent, reply, new_state} | {:stop, reason, new_state} | {:stop, {:shutdown, :delete}, new_state} | {:stop, {:shutdown, :permanent}, new_state} | {:stop, :delete, new_state} | {:stop, :permanent, new_state} when reply: term(), new_state: term(), reason: term()
@callback handle_cast(request :: term(), state :: term()) :: {:noreply, new_state} | {:noreply, new_state, timeout_action()} | {:noreply, new_state, callback_options()} | {:noreply, new_state, timeout_action(), callback_options()} | {:stop, reason :: term(), new_state} | {:stop, {:shutdown, :delete}, new_state} | {:stop, {:shutdown, :permanent}, new_state} | {:stop, :delete, new_state} | {:stop, :permanent, new_state} when new_state: term()
@callback handle_continue(continue :: term(), state :: term()) :: {:noreply, new_state} | {:noreply, new_state, timeout_action()} | {:noreply, new_state, callback_options()} | {:noreply, new_state, timeout_action(), callback_options()} | {:stop, reason :: term(), new_state} | {:stop, {:shutdown, :delete}, new_state} | {:stop, {:shutdown, :permanent}, new_state} | {:stop, :delete, new_state} | {:stop, :permanent, new_state} when new_state: term()
@callback handle_info(msg :: :timeout | term(), state :: term()) :: {:noreply, new_state} | {:noreply, new_state, timeout_action()} | {:noreply, new_state, callback_options()} | {:noreply, new_state, timeout_action(), callback_options()} | {:stop, reason :: term(), new_state} | {:stop, {:shutdown, :delete}, new_state} | {:stop, {:shutdown, :permanent}, new_state} | {:stop, :delete, new_state} | {:stop, :permanent, new_state} when new_state: term()
@callback init(loaded_state :: map()) :: :ignore | {:ok, state :: term()} | {:ok, state :: term(), [init_option()]}
Initializes the DurableServer with loaded state.
This callback is invoked after the server acquires its global lock and loads
any persisted state. You can implement either init/1 or init/2:
init/1- Receives only the loaded stateinit/2- Receives the loaded state and an info map with runtime information
If you implement init/2, it takes precedence over init/1.
The Info Map (init/2)
The info map in init/2 contains:
:key- The DurableServer key:supervisor- The supervisor name (e.g.,MyApp.DurableSup):task_supervisor- The task supervisor for async operations:dynamic_supervisor- The dynamic supervisor managing DurableServer processes- Any user-defined keys from the supervisor's
:init_infooption
Return Values
{:ok, state}- Initialize with the given state{:ok, state, opts}- Initialize with state and options:ignore- Don't start the server, sync as stopped_graceful
Options
:auto_sync- Enable automatic syncing on every callback return (default:false):sync_every_ms- Periodic sync interval in milliseconds (default:30_000):meta- User metadata returned byDurableServer.Supervisor.lookup/2:permanent- Mark server for automatic restart by LifecycleManager (default:false)
Examples
# Simple init/1
def init(state) do
{:ok, state, permanent: true}
end
# init/2 with runtime info
def init(state, info) do
# Access built-in values
%{key: key, task_supervisor: task_sup} = info
# Access user-defined values from supervisor's init_info
api_client = info.api_client
{:ok, Map.merge(state, %{task_sup: task_sup, api_client: api_client})}
end
@callback init(loaded_state :: map(), info :: map()) :: :ignore | {:ok, state :: term()} | {:ok, state :: term(), [init_option()]}
@callback load_state(old_vsn :: pos_integer() | nil, persisted_state :: map()) :: map()
Transform backend-decoded persisted state back into user state format.
This required callback is used when loading state from the configured backend. It allows you to:
- Convert backend-specific persisted shapes into your runtime state format
- Set default values for missing keys
- Initialize ephemeral state that wasn't persisted
On first boot for a never-before-persisted server, DurableServer encodes and
decodes the result of dump_state/1 through the configured backend before
calling load_state/2. This keeps the first-boot shape consistent with the
shape you will receive on later restarts for that backend.
Persisted state is backend-dependent. For example:
DurableServer.Backends.ObjectStoreusually passes JSON-decoded maps with string keysDurableServer.Backends.EKVStoremay pass maps with atom keys or other native Elixir terms
During backend migrations, it is valid for load_state/2 to receive multiple
historical shapes until the migration is complete.
For a server that has never been persisted, the old_vsn will be nil.
Note: the function is NOT guaranteed to be idempotent. The durable server
is not considered started until after load_state/2 is run and a lock is
succesfully obtained with your loaded state. Concurrent nodes can race your
state load and aquire the lock before you, so this function should not issue
side effects like calling other processes. Peform such side effect work
inside init/1, which is gauranteed to have started your durable server with
a successful global lock.
Examples
def load_state(_old_vsn, dumped_state) do
# Convert string keys to atoms and add ephemeral state
%{
count: Map.fetch!(dumped_state, "count"),
temp_data: nil,
status: :initialized
}
end
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
Attempt to atomically claim a restart attempt for a server.
Returns :ok if the claim succeeds, or {:error, reason} if it fails.
Clear restart attempt metadata from a server object.
Fetches the DurableServer's current state from storage.
Get just the metadata for a server without the full object.