dist_agent v0.1.1 DistAgent View Source

Elixir framework to run distributed, fault-tolerant variant of Agent.

Hex.pm Build Status Coverage Status

Overview

dist_agent is an Elixir library (or framework) to run many “distributed agent”s within a cluster of ErlangVM nodes. “Distributed agent” has the followings in common with Elixir’s Agent:

  • support for arbitrary data structure
  • synchronous communication pattern

On the other hand, “distributed agent” offers the following features:

  • synchronous state replication within multiple nodes for fault tolerance
  • location transparency using logical agent IDs
  • automatic placement and migration of processes for load balancing and failover
  • agent lifecycle management (activated at 1st command, deactivated after a period of inactivity)
  • upper limit (“quota”) on number of distributed agents for multi-tenant use cases
  • optional rate limiting on incoming messages to each distributed agent
  • low-resolution timer, similar to GenServer’s :timeout, for distributed agents

Concepts

  • Distributed agent

    • Distributed agent represents a state and its associated behaviour. It can also take autonomous actions using the tick mechanism (explained below).
    • Each distributed agent is identified by the following triplet:

      • quota_id: a String.t that specifies the quota which this distributed agent belongs to
      • module: a callback module of DistAgent.Behaviour
      • key: an arbitrary String.t that uniquely identify the distributed agent within the same quota_id and module
    • Behaviour of a distributed agent is defined by the module part of its identity.

      • The callbacks are divided into “pure” ones and “side-effecting” ones.
    • Distributed agent is “activated” (initialized) when DistAgent.command/5 is called with a nonexisting ID.
    • Distributed agent is “deactivated” (removed from memory) when it’s told to do so by the callback.
  • Quota

    • Quota defines an upper limit of number of distributed agents that can run within it (soft limit).
    • Each quota is identified by a quota_id (String.t).
    • Each distributed agent belongs to exactly one quota; quota must be created before activating distributed agents within it.
  • Tick

    • Ticks are periodic events which all distributed agents receive.
    • Ticks are emitted by a limited number of dedicated processes (whose sole task is to periodically emit ticks), thus reducing number of timers that have to be maintained.
    • Each distributed agent specifies “what to do on the subsequent ticks” in callback’s return value:

      1. do nothing
      2. trigger timeout after the specified number of ticks (i.e., use it as a low-resolution timer).
      3. deactivate itself when it has received the specified number of ticks without client commands

Design

Raft protocol and libraries

dist_agent heavily depends on Raft consensus protocol for synchronous replication and failover. The core protocol is implemented in rafted_value and the cluster management and fault tolerance mechanism are provided by raft_fleet.

Although Raft consensus groups provide an important building block for distributed agents, it’s unclear how we should map the concept of “distributed agent”s to consensus groups. It can be easily seen that the following 2 extremes are not optimal for wide range of use cases:

  • only 1 consensus group for all distributed agents

    • Not scalable for large number of agents; obviously the leader process becomes the bottleneck.
  • consensus group (which typically consists of 3 processes in 3 nodes) per distributed agent

    • Cost for timers and healthchecks scales linearly with number of consensus groups; for many agents, CPU resources are wasted by just maintaining consensus groups.

And (of course) number of distributed agents in a system changes over time. We take an approach that

  • each consensus group hosts multiple distributed agents, and
  • number of consensus groups is dynamically adjusted according to the current load.

This dynamic “sharding” of distributed agents and also the agent ID-based data model are defined by raft_kv. This design may introduce a potential problem: a distributed agent can be blocked by a long-running operation of another agent which happened to reside in the same consensus group. It is the responsibility of implementers of the callback modules for distributed agents to ensure that handlers of query/command/timeout don’t take long time.

Even with reduced number of consensus groups explained above, state replications and healthchecks involve high rate of inter-node communications. In order to reduce network traffic and TCP overhead, remote communications between nodes are batched with the help of batched_communication.

Since establishing a consensus (committing a command) in the Raft protocol requires round trips to remote nodes, it is a relatively expensive operation. In order not to overwhelm raft member processes, accesses to each agent may be rate-limited by the token bucket algorithm. Rate limiting is (when enabled) imposed on a per-node basis; in each node, there exists a bucket per distributed agent. We use foretoken as the token bucket implementation.

Quota management

Current statuses of all quotas are managed by a special Raft consensus group named DistAgent.Quota. It’s internal state consists of

  • %{node => {%{quota_id => count}, time_reported}}
  • %{quota_id => limit}.

When adding a new distributed agent, the upper limit is checked by consulting with this Raft consensus group. %{quota_id => count} reported from each node is valid for 15 minutes. Counts in removed/unreachable nodes are thus automatically cleaned up.

In each node a GenServer named DistAgent.Quota.Reporter periodically aggregates number of distributed agents queried from consensus leader processes that reside in the node. It periodically publishes the aggregated value to DistAgent.Quota.

Quota is checked only when making a new distributed agent, i.e., on receipt of 1st message to a distributed agent, the quota limit violation is checked. Already created distributed agent is never blocked/stopped due to quota limit. Especially agent migration and failover won’t be affected.

Things dist_agent won’t do

Currently we have no plan to:

  • provide API to efficiently retrieve list of active distributed agents
  • provide something like links/monitors that Erlang processes have

Link to this section Summary

Functions

Sends a command to the specified distributed agent and receives a reply from it

Deletes limit of the specified quota

Initializes :dist_agent

Lists existing quota names

Adds or updates limit of the specified quota

Sends a read-only query to the specified distributed agent and receives a reply from it

Returns a pair of current number of distributed agents in the specified quota and its upper limit

Link to this section Types

Link to this type init_option() View Source
init_option() ::
  {:rv_options, [RaftedValue.option()]}
  | {:split_merge_policy, RaftKV.SplitMergePolicy.t()}
Link to this type option() View Source
option() ::
  RaftKV.option()
  | {:rate_limit,
     nil
     | {milliseconds_per_token :: pos_integer(), max_tokens :: pos_integer()}}

Link to this section Functions

Link to this function command(quota_name, callback_module, agent_key, command, options \\ []) View Source
command(
  DistAgent.Quota.Name.t(),
  module(),
  String.t(),
  DistAgent.Behaviour.command(),
  [option()]
) ::
  {:ok, DistAgent.Behaviour.ret()}
  | {:error,
     :quota_limit_reached
     | :quota_not_found
     | {:rate_limit_reached, milliseconds_to_wait :: pos_integer()}
     | :no_leader}

Sends a command to the specified distributed agent and receives a reply from it.

The target distributed agent is specified by the triplet: quota_name, callback_module and agent_key. If the agent has not yet activated, it is activated before applying the command. During activation of the new agent, quota limit is checked (and it involves additional message round-trip). As such it is necessary that the quota is created (by put_quota/2) before calling this function. When the quota limit is violated {:error, :quota_limit_reached} is returned.

On receipt of the command, the distributed agent evaluates DistAgent.Behaviour.handle_command/2 callback and then DistAgent.Behaviour.after_command/4 callback. For the semantics of the arguments and return values of the callbacks refer to DistAgent.Behaviour.

Options

The last argument to this function is a list of options. Most of options are directly passed to the underlying function (RaftKV.command/4). However, unlike RaftKV.command/4, :call_module option defaults to BatchedCommunication in this function.

You may also pass :rate_limit option to enable per-node rate limiting feature. The value part of :rate_limit option must be a pair of positive integers. Executing a command consumes 3 tokens in the corresponding bucket (as command is basically more expensive than query). See also Foretoken.take/4.

Link to this function delete_quota(quota_name) View Source
delete_quota(DistAgent.Quota.Name.t()) :: :ok

Deletes limit of the specified quota.

Link to this function init(init_options \\ []) View Source
init([init_option()]) :: :ok

Initializes :dist_agent.

Note that :dist_agent requires that you complete the following initialization steps before calling this function:

  1. connect to the other existing nodes in the cluster
  2. call RaftFleet.activate/1
  3. call RaftKV.init/0

Lists existing quota names.

Adds or updates limit of the specified quota.

Link to this function query(quota_name, callback_module, agent_key, query, options \\ []) View Source
query(
  DistAgent.Quota.Name.t(),
  module(),
  String.t(),
  DistAgent.Behaviour.query(),
  [option()]
) ::
  {:ok, DistAgent.Behaviour.ret()}
  | {:error,
     :agent_not_found
     | {:rate_limit_reached, milliseconds_to_wait :: pos_integer()}
     | :no_leader}

Sends a read-only query to the specified distributed agent and receives a reply from it.

The target distributed agent is specified by the triplet: quota_name, callback_module and agent_key. If the agent has not yet activated {:error, :agent_not_found} is returned.

On receipt of the query, the distributed agent evaluates DistAgent.Behaviour.handle_query/2 callback and then DistAgent.Behaviour.after_query/3 callback. For the detailed semantics of the arguments and return values of the callbacks refer to DistAgent.Behaviour.

Options

The last argument to this function is a list of options. Most of options are directly passed to the underlying function (RaftKV.query/4). However, unlike RaftKV.query/4, :call_module option defaults to BatchedCommunication in this function.

You may also pass :rate_limit option to enable per-node rate limiting feature. The value part of :rate_limit option must be a pair of positive integers. Executing a query consumes 1 token in the corresponding bucket. See also Foretoken.take/4.

Returns a pair of current number of distributed agents in the specified quota and its upper limit.