Raft v0.2.1 Raft

Raft provides users with an api for building consistent (as defined by CAP), distributed state machines. It does this using the raft leader election and concensus protocol as described in the original paper.

Example

Lets create a distributed key value store. The first thing that we’ll need is a state machine:

defmodule KVStore do
  use Raft.StateMachine

  @initial_state %{}

  def set(name, key, value) do
    Raft.write(name, {:set, key, value})
  end

  def get(name, key) do
    Raft.read(name, {:get, key})
  end

  def init(_name) do
    {:ok, @initial_state} 
  end

  def handle_write({:set, key, value}, state) do
    {{:ok, key, value}, put_in(state, [key], value)}
  end

  def handle_read({:get, key}, state) do
    case get_in(state, [key]) do
      nil ->
        {{:error, :key_not_found}, state}
      value ->
        {{:ok, value}, state}
    end
  end
end

Now we can start our peers:

{:ok, _pid} = Raft.start_peer(KVStore, name: :s1)
{:ok, _pid} = Raft.start_peer(KVStore, name: :s2)
{:ok, _pid} = Raft.start_peer(KVStore, name: :s3)

Each node must be given a unique name within the cluster. At this point our nodes are started but they’re all followers and don’t know anything about each other. We need to set the configuration so that they can communicate:

Raft.set_configuration(:s1, [:s1, :s2, :s3])

Once this runs the peers will start an election and elect a leader. You can check the current leader like so:

leader = Raft.leader(:s1)

Once we have the leader we can read and write to our state machine:

{:error, :key_not_found} = KVStore.get(leader, :foo)
{:ok, :foo, :bar} = KVStore.write(leader, :foo, :bar)
{:ok, :bar} = KVStore.read(leader, :foo)

We can now shutdown our leader and ensure that a new leader has been elected and our state is replicated across all nodes:

Raft.stop(leader)

# wait for election...

new_leader = Raft.leader(:s2)
{:ok, :bar} = KVStore.read(new_leader, :foo)

We now have a consistent, replicated key-value store.

Failures and re-elections

Networks disconnects and other failures will happen. If this happens the peers might elect a new leader. If this occurs you will see messages like this:

{:error, :election_in_progress} = KVStore.get(leader, :foo)
{:error, {:redirect, new_leader}} = KVStore.get(leader, :foo)

State Machine Message Safety

The commands sent to each state machine are opaque to the raft protocol. There is no validation done to ensure that the messages conform to what the user state machine expects. Also these logs are persisted. What this means is that if a message is sent that causes the user state machine to crash it will crash the raft process until a code change is made to the state machine. There is no mechanism for removing an entry from the log. Great care must be taken to ensure that messages don’t “poison” the log and state machine.

Log Storage

The log and metadata store is persisted to disk using rocksdb. This allows us to use a well known and well supported db engine that also does compaction. The log store is built as an adapter so its possible to construct other adapters for persistence.

Protocol Overview

Raft is a complex protocol and all of the details won’t be covered here. This is an attempt to cover the high level topics so that users can make more informed technical decisions.

Key Terms:

  • Cluster - A group of peers. These peers must be explicitly set.

  • Peer - A server participating in the cluster. Peers route messages, participate in leader election and store logs.

  • Log - The log is an ordered sequence of entries. Each entry is replicated on each peer and we consider the log to be consistent if all peers in the cluster agree on the entries and their order. Each log contains a binary blob which is opaque to the raft protocol but has meaning in the users state machine. This log is persisted to the local file system.

  • Quorum - A majority of peers. In raft this is (2/n)+1. Using a quorum allows some number of peers to be unavailable during leader election or replication.

  • Leader - At any time there will only be 1 leader in a cluster. All reads and writes and configuration changes must pass through the leader in order to provide consistency. Its the leaders responsibility to replicate logs to all other members of the cluster.

  • Committed - A log entry is “committed” if the leader has replicated it to a majority of peers. Only committed entries are applied to the users state machine.

Each peer can be in 1 of 3 states: follower, leader, or candidate. When a peer is started it starts in a follower state. If a follower does not receive messages within a random timeout it transitions to a candidate and starts a new election.

During an election the candidate requests votes from all of the other peers. If the candidate receives enough votes to have a quorum then the candidate transitions to the leader state and informs all of the other peers that they are the new leader.

The leader accepts all reads and writes for the cluster. If a write occurs then the leader creates a new log entry and replicates that entry to the other peers. If a peer’s log is missing any entries then the leader will bring the peer up to date by replicating the missing entries. Once the new log entry has been replicated to a majority of peers, the leader “commits” the new entry and applies it to the users state machine. In order to provide consistent reads a leader must ensure that they still maintain a quorum. Before executing a read the leader will send a message to each follower and ensure that they are still the leader. This provides consistent views of the data but it also can have performance implications for read heavy operations.

Each time the followers receive a message they reset their “election timeout”. This process continues until a follower times out and starts a new election, starting the cycle again.

Link to this section Summary

Functions

Gets an entry from the log. This should only be used for testing purposes

Returns the leader according to the given peer

Reads state that has been applied to the state machine

Sets peers configuration. The new configuration will be merged with any existing configuration

Starts a new peer with a given Config.t

Returns the current status for a peer. This is used for debugging and testing purposes only

Gracefully stops the node

Creates a test cluster for running on a single. Should only be used for development and testing

Used to apply a new change to the application fsm. This is done in consistent manner. This operation blocks until the log has been replicated to a majority of servers

Link to this section Types

Link to this type opts()
opts() :: [name: peer(), config: Raft.Config.t()]
Link to this type peer()
peer() :: atom() | {atom(), atom()}

Link to this section Functions

Link to this function get_entry(to, index)
get_entry(peer(), non_neg_integer()) ::
  {:ok, Raft.Log.Entry.t()} | {:error, term()}

Gets an entry from the log. This should only be used for testing purposes.

Link to this function leader(name)
leader(peer()) :: peer() | :none

Returns the leader according to the given peer.

Link to this function read(leader, cmd, timeout \\ 3000)
read(peer(), term(), any()) ::
  {:ok, term()} | {:error, :timeout} | {:error, :not_leader}

Reads state that has been applied to the state machine.

Link to this function set_configuration(peer, configuration)
set_configuration(peer(), [peer()]) ::
  {:ok, Raft.Configuration.t()} | {:error, term()}

Sets peers configuration. The new configuration will be merged with any existing configuration.

Link to this function start_peer(mod, opts)
start_peer(module(), opts()) :: {:ok, pid()} | {:error, term()}

Starts a new peer with a given Config.t.

Link to this function status(name)
status(peer()) :: {:ok, %{}} | {:error, :no_node}

Returns the current status for a peer. This is used for debugging and testing purposes only.

Link to this function stop_peer(name)

Gracefully stops the node.

Link to this function test_cluster()
test_cluster() :: {peer(), peer(), peer()}

Creates a test cluster for running on a single. Should only be used for development and testing.

Link to this function test_node(name)
Link to this function write(leader, cmd, timeout \\ 3000)
write(peer(), term(), any()) ::
  {:ok, term()} | {:error, :timeout} | {:error, :not_leader}

Used to apply a new change to the application fsm. This is done in consistent manner. This operation blocks until the log has been replicated to a majority of servers.