Stoker behaviour (stoker v0.1.3)
Stoker makes sure that your cluster keeps running, no matter what happens.
One of the big ideas behind Elixir/Erlang is distribution as a way to address faults - if you have multiple servers running in a cluster, and one of them dies, the others keep on churning. This is quite easy to do when those servers are all alike, e.g. a webserver running a Phoenix app - wherever the request lands, it is processed there. But you can do this easily is any environment - Java, Go, whatever.
What is more interesting in the Elixir/Erlang world is the ability to have cluster-unique processes that end up being distribuited and surviving machine faults. This happens a lot of times to me - I often have to write data import jobs thet retrieve, rewrite and forward data. I want them to be always available, will accept small glitches (because a process may die on my end or on the other end) but I never want two processes to run the same job at the same time.
Stoker is the foundation upon which this is built. A single instance of Stoker is always running in the cluster - and it it dies, another one is restarted on a surviving node. It has full visibility of the cluster, and a list of jobs to assure being up. It will try to make sure that all of them are available, and if some are missing, will restart them on one of the available nodes. Usually, this happens by delegating them to a local DynamicSupervisor, that once started will do its magic to keep the process running.
Every once in a while, or when receiving a wake-up signal, it will wake up again and make sure that everything is in order. This is beacuse it is quite likely that you have a dynamic list of jobs you want to run - e.g. you may add or remove them for a database table - so Stoker will react automatically when you add a new one. It is also possible that some external conditions had the DynamicSupervisor give up, so it may be appropriate to restart the process on another random node and try again.
Implementation
Here is a typical scenario running Stoker. Our application runs on two nodes, and our goal is to keep Process X alive somewhere.
On start-up, the Stoker Activator process on node 1 becomes active (we say it's a Leader), and the one on node 2, that was started a few seconds after that, noticed that there was already a running one, and became a Follower of the Leader, waiting for its brother on node 1 to terminate.
So the process on node 1 called into your module - one you created implementing the Stoker behavior - and asked it to check and activate all processes it needed to. Your module choose a random DynamicSupervisor out of the node pool, selected the one on node 2, and asked it to start supervising Process X.
Every once in a while, Stoker on node 1 wakes up and calls your module to make sure that all processes that are supposed to be alive actually are; if not, they are started again.
What happens if node 1 dies?
Stoker on node 2 will notice that its brother failed, so it will become a leader. It will then trigger your module, that will notice that Process X is available, and do nothing more. If there were some persistent processes on node 1, they would be started again on node 2, as that is the only remaining node.
If node 1 restarts, Stoker on node 1 will notice that there is already a Leader on node 2, so it will become a Follower - it will put a watch on the Leader and will wait for it to become unavailable.
What happens if node 2 dies?
If node 2 dies, Stoker on node 1 - that is already a Leader - receives an update that the cluster composition changed; it will trigger your module, that will notice that Process X is not running, and will start it again on the only remaining node.
When node 2 restarts, Stoker on it will notice that there is already a Leader on node 1, and will become a Follower.
What happens if we add node 3?
If we add a new node, Stoker on node 1 receives an update that the cluster composition changed; it will trigger your module, that will notice that Process X is still running, and won't do anything.
The new Stoker on node 3 will become a Follower, like the one on node 2 is. If node 1 becomes unavailable, both will race to become the new Leader; one of them will succeed, and the other one will become a Follower.
As node 3 has its own DynamicSupervisor, it will become eligible to run new processes as the need arises.
Network partitions
This is the tough one.
On a netsplit, both nodes keep running but none of them sees the other one. In this case, each Stoker will become Leader and run Process X on its own pool. In any case, your module will be notified, so you could decide what to do - terminate all processes, wait a bit, whatever.
When the netsplit heals, one of the Stoker processes is terminated, and the same happens for every processed that has registered twice. So one of the processes will remain leader, the one that was terminated will respawn and become a follower.
Rebalancing
There is no facility to do rebalancing yet, but as your module is triggered on network events, you could do that.
Guaranteeing uniqueness
To guarantee uniqueness of running processes, we use the battle-tested
:global
naming module, that will make sure that a name is registered
only once.
It's not very fast, but can manage thousands of registrations per seconds on moderately-sized clusters and it's supposedly very reliable, so it's good enough for what we need here.
Using in practice
Start-up within an application
The Activator is implemented by Stoker.Activator
,
that is a GenServer that will call your Stoker behavior
when needed.
So in your Application sequence you will need to add:
{DynamicSupervisor, [name: {StokerDS, node()}, strategy: :one_for_one]},
{Stoker.Activator, xx.MyStoker},
In the first row, we ask each node of the cluster
to start up and register on :global
a DynamicSupervisor
that is called {StokerDS, node1@cluster}
. This way
we can address it easily from any node in the cluster.
On the second row, we start a Stoker.Activator
GenServer
that, when acting as a leader, registers
itself as {Stoker.Activator, your_module_name}
,
so you can definitely have more than one running on the
same cluster.
The Stoker life-cycle
The life-cycle of a Stoker call-back is modelled on
the one that a GenServer
offers.
There is a state term that can be used to store a state to be held between multiple calls (but only on the same server - the state is not shared with followers).
Summary
Callbacks
In front of a cluster change, determines whether we are on the losing side of a netsplit situation or not.
Types
activator_event()
@type activator_event() ::
:now_leader
| :now_follower
| :cluster_change
| :cluster_split
| :timer
| :trigger
| :shutdown
activator_state()
@type activator_state() :: :leader | :follower
Callbacks
cluster_valid?(stoker_state)
@callback cluster_valid?(stoker_state :: term()) :: :yes | :no | :cluster_split
In front of a cluster change, determines whether we are on the losing side of a netsplit situation or not.
For example, if we have a three node cluster, we may protect against netsplits by saying that any node left that has less than two nodes is on the losing side of a netsplit, so should terminate all processes.
If the answer is:
:yes
- calls event:cluster_change
:cluster_split
- the cluster is invalid, that is we are on the losing side of a net-split, so call event:cluster_split
so local processes can be terminatedno
- the cluster is invalid, but do not raise any event
event(stoker_state, event_type, reason)
@callback event(stoker_state :: term(), event_type :: activator_event(), reason :: term()) :: {:ok, new_state :: term()} | {:error, reason :: term()}
init()
next_timer_in(stoker_state)
Functions
hello()
Hello world.
Examples
iex> Stoker.hello()
:world