Copyright © 2021 VMware, Inc. or its affiliates. All rights reserved.
Version: 0.1.0
Authors: Jean-Sébastien Pédron (jean-sebastien@rabbitmq.com), Karl Nilsson (nkarl@vmware.com), The RabbitMQ team (info@rabbitmq.com).
Khepri is a tree-like replicated on-disk database library for Erlang and Elixir.
Data are stored in a tree structure. Each node in the tree is referenced by its path from the root node. A path is a list of Erlang atoms and/or binaries. For ease of use, Unix-like path strings are accepted as well.
For consistency and replication and to manage data on disk, Khepri relies on Ra, an Erlang implementation of the Raft consensus algorithm. In Ra parlance, Khepri is a state machine in a Ra cluster.
This page describes all the concepts in Khepri and points the reader to the modules' documentation for more details.
This started as an experiment to replace how data other than message bodies are stored in the RabbitMQ messaging broker. Before Khepri, those data were stored and replicated to cluster members using Mnesia.
Mnesia is very handy and powerful:However, recovering from any network partitions is quite difficult. This was the primary reason why the RabbitMQ team started to explore other options.
Because RabbitMQ already uses an implementation of the Raft consensus algorithm for its quorum queues, it was decided to leverage that library for all metadata. That's how Khepri was borned.
Thanks to Ra and Raft, it is clear how Khepri will behave during and recover from a network partition. This makes it more comfortable for the RabbitMQ team and users, thanks to the absence of unknowns.
At the time of this writing, RabbitMQ does not use Khepri in a production release yet because this library and its integration into RabbitMQ are still a work in progress.
khepri_machine:tree_node()
) in a tree structure. Every tree nodes have:
o
|
+-- orders
|
`-- stock
|
`-- wood
|-- <<"mapple">> = 12
`-- <<"oak">> = 41
A tree node name is either an Erlang atom or an Erlang binary (khepri_path:node_id()
).
A tree node may or may not have a payload. Khepri currently supports a single type of payload, the data payload. More payload types may be added in the future.
Payloads are represented using macros or helper functions:?NO_PAYLOAD
and khepri:no_payload/0
?DATA_PAYLOAD(Term)
and khepri:data_payload/1
Functions in khepri_machine
have no assumption on the type of the
payload because they are a low-level API. Therefore, it must be specified
explicitly using the macros or helper functions mentionned above.
Most functions in khepri
, being a higher-level API, target more
specific use cases and assume a particular type of payload.
khepri_machine:payload_version()
).khepri_machine:child_list_version()
).khepri_machine:child_list_count()
).The equivalent of a key in a key/value store is a path
(khepri_path:path()
) in Khepri.
khepri_path:path()
). For instance:
%% Points to "/stock/wood/oak" in the tree showed above:
Path = [stock, wood, <<"oak">>].
It is possible to target multiple tree nodes at once by using a path
pattern (khepri_path:pattern()
). In addition to node IDs, path
patterns have conditions (khepri_condition:condition()
). Conditions allow things like:
%% Matches all varieties of wood in the stock:
PathPattern = [stock, wood, #if_node_matches{regex = any}].
%% Matches the supplied of oak if there is an active order:
PathPattern = [order,
wood,
#if_all{conditions = [
<<"oak">>,
#if_data_matches{pattern = {active, true}}]},
supplier].
Finally, a path can use some special path component names, handy when using
relative paths:
?THIS_NODE
to point to self?PARENT_NODE
to point to the parent tree node?ROOT_NODE
to explicitly point to the root unnamed nodeRelative paths are useful when putting conditions on tree node lifetimes.
A tree node's lifetime starts when it is inserted the first time, until it is removed from the tree. However, intermediary tree nodes created on the way remain in the tree long after the leaf node was removed.
For instance, when [stock, wood, <<"walnut">>]
was inserted, the intermediary
tree nodes stock
and wood
were created if they were missing. After
<<"walnut">>
is removed, they will stay in the tree with possibly neither
payload nor child nodes.
Khepri has the concept of keep_until
conditions. A keep_until
condition is like the conditions which can be used inside path pattern. When a
node is inserted or updated, it is possible to set keep_until
conditions:
when these conditions evaluate to false, the tree node is removed from the
tree.
[stock, wood]
to make sure it is removed after its last child node is removed:
%% We keep [stock, wood] as long as its child nodes count is strictly greater
%% then zero.
KeepUntilCondition = #{[stock, wood] => #if_child_list_count{count = {gt, 0}}}.
keep_until
conditions on self (like the example above) are not evaluated on
the first insert though.
A high-level API is provided by the khepri
module. It covers most
common use cases and should be straightforward to use.
khepri:insert([stock, wood, <<"lime tree">>], 150),
Ret = khepri:get([stock, wood, <<"lime tree">>]),
{ok, #{[stock, wood, <<"lime tree">>] =>
#{child_list_count => 0,
child_list_version => 1,
data => 150,
payload_version => 1}}} = Ret,
true = khepri:exists([stock, wood, <<"lime tree">>]),
khepri:delete([stock, wood, <<"lime tree">>]).
The high-level API is built on top of a low-level API. The low-level API is
provided by the khepri_machine
module.
The low-level API provides just a handful of primitives. More advanced or specific use cases may need to rely on that low-level API.
%% Unlike the high-level API's `khepri:insert/2' function, this low-level %
%% insert returns whatever it replaced (if anything). In this case, there was
%% nothing before, so the returned value is pretty empty.
Ret1 = khepri_machine:put(StoreId, [stock, wood, <<"lime tree">>], ?DATA_PAYLOAD(150)),
{ok, #{}} = Ret1,
Ret2 = khepri_machine:get(StoreId, [stock, wood, <<"lime tree">>]),
{ok, #{[stock, wood, <<"lime tree">>] =>
#{child_list_count => 0,
child_list_version => 1,
data => 150,
payload_version => 1}}} = Ret2,
%% Unlike the high-level API's `khepri:delete/2' function, this low-level
%% delete returns whatever it deleted.
Ret3 = khepri_machine:delete(StoreId, [stock, wood, <<"lime tree">>]),
{ok, #{[stock, wood, <<"lime tree">>] =>
#{child_list_count => 0,
child_list_version => 1,
data => 150,
payload_version => 1}}} = Ret3.
It is possible to have multiple database instances running on the same node or cluster.
By default, Khepri starts a default store, based on Ra's default system.
On the surface, Khepri transactions look like Mnesia ones: they are anonymous functions which can do any arbitrary operations on the data and return any result. If something goes wrong or the anonymous function aborts, nothing is committed and the database is left untouched as if the transaction code was never called.
Under the hood, there are several restrictions and caveats that need to be understood in order to use transactions in Khepri:The nature of the anonymous function is passed as the ReadWrite
argument to
khepri:transaction/3
or khepri_machine:transaction/3
functions.
The Raft algorithm is used to achieve consensus among Khepri members participating in the database. Khepri is a state machine executed on each Ra node and all instances of that Khepri state machine start with the same state and modify it identically. The goal is that, after the same list of Ra commands, all instances have the same state.
When a new Ra node joins the cluster and therefore participates to the Khepri database, it starts a new Khepri state machine instance. This instance needs to apply all Ra commands from an initial state to be on the same page as other existing instances.
Likewise, if for any reason, one of the Khepri state machine instance looses the connection to other members and can't apply Ra commands, then the link comes back, it has to catch up.
All this means that the code to modify the state of the state machines (i.e. the tree) needs to run on all instances, possibly not a the same time, and give the exact same result everywhere.
To achieve that, khepri_fun
and khepri_tx
extract the assembly
code of the anonymous function and create a standalone Erlang module based on
it. This module can be stored in Ra's log and executed anywhere without the
presence of the initial anonymous function's module.
Generated by EDoc