exmld v0.1.4 Exmld View Source

This allows items extracted from Kinesis stream records (or sub-records in a KPL aggregate record) to be processed by a pipeline of workers which may differ in number from the number of shards owned by the current node (which is the normal processing model offered by erlmld).

This is beneficial when using aggregate records which can be processed in approximate order according to their partition keys as opposed to strict ordering based on the shards they arrived on. For example, suppose the following two Kinesis records are received on two different shards:

Record 1 (a KPL aggregate record)
  - partition key: "xyzzy"
  - subrecord a:
    - partition key: "asdf"
    - value: "12345"
  - subrecord b:
    - partition key: "fdsa"
    - value: "54321"

Record 2 (a KPL aggregate record)
  - partition key: "qwer"
  - subrecord a:
    - partition key: "asdf"
    - value: "23456"
  - subrecord b:
    - partition key: "z"
    - value: "0"

Using the normal Kinesis processing paradigm, each shard will be processed in order. erlmld supports this by spawning a process for each owned shard, which handles each record seen on the shard in sequence:

Worker 1:
  1. handle record "xyzzy"
    a. handle sub-record "asdf"
    b. handle sub-record "fdsa"

Worker 2:
  1. handle record "qwer"
    a. handle sub-record "asdf"
    b. handle sub-record "z"

This can fail to make use of all available resources since the maximum concurrency is limited by the number of owned shards. If the application can tolerate the handling of sub-records in a non-strict order, it can use a Flow-based MapReduce-style scheme:

[Worker 1]  [Worker 2]     (processes which produce Kinesis records)
    |           |
    v           v
[Exmld.KinesisStage, ...]  (stages receiving Exmld.KinesisWorker.Datums)
          |
          v
    [M1] .... [Mn]  (mappers which extract items)
      |\       /|
      | \     / |
      |  \   /  |
      |   \ /   |
      |    \    |
      |   / \   |
      |  /   \  |
      | /     \ |
      |/       \|
    [R1] .... [Rn]  (reducers which handle extracted items)

The number of reducers is configurable and defaults to the number of schedulers online. The processing application will specify a means of extracting a partition key from each extracted item; these will be used to consistently map items to reducers (which is where the actual application work occurs).

Using the above example and specifying a sub-record’s partition key as an item key:

  1. Worker 1 will produce the “asdf” and “fdsa” sub-records from outer record “xyzzy” and send them to a pre-configured Exmld.KinesisStage (or round-robin to a list of such stages).

  2. Worker 2 will similarly produce the “asdf” and “z” sub-records from outer record “qwer”.

  3. Each receiving stage will wrap and forward these sub-records for handling by the flow.

  4. The application will have provided an “identity” item extraction function since KPL aggregation is being used here (or otherwise a function accepting one record and returning a list containing a single item).

  5. The application will have provided a partition key extraction function which returns an appropriate partition key to be used in consistently mapping items to reducers.

  6. The first received “asdf” sub-record is provided to some reducer Rx. The second received “asdf” sub-record is provided to the same reducer since its extracted key has the same hash.

  7. The “fdsa” and “z” sub-records are similarly provided to some worker Ry and/or Rz based on the hash of their partition keys.

  8. The application-provided reducer function notifies each originating stage of the disposition of processing for items received from it as processing progresses.

  9. Eventually, processing disposition is provided back to the originating workers, which can decide whether or not (and where) to checkpoint.

Link to this section Summary

Link to this section Types

Link to this type checkpoint() View Source
checkpoint() :: {:checkpoint, term()}
Link to this type item() View Source
item() :: any()
Link to this type partition_key() View Source
partition_key() :: any()
Link to this type reducer_state() View Source
reducer_state() :: any()
Link to this type sequence_number() View Source
sequence_number() :: {:sequence_number, term(), term(), term(), term()}
Link to this type shard_id() View Source
shard_id() :: binary()
Link to this type stream_record() View Source
stream_record() :: {:stream_record, term(), term(), term(), term(), term()}

Link to this section Functions

Link to this macro checkpoint(args \\ []) View Source (macro)
Link to this macro checkpoint(record, args) View Source (macro)
Link to this function flow(flow, extract_items_fn, partition_key, state0, process_fn, opts \\ []) View Source
flow(flow :: Flow.t(), extract_items_fn :: (Exmld.KinesisWorker.Datum -> [item()]), partition_key :: {:elem, non_neg_integer()} | {:key, atom()} | (item() -> partition_key()), state0 :: (() -> reducer_state()), process_fn :: (item(), reducer_state() -> reducer_state()), opts :: keyword()) :: Flow.t()

Accepts a flow producing Exmld.KinesisWorker.Datums (e.g,. a flow created from Exmld.KinesisStages) and returns another flow.

You can use this one to keep building your flow after calling flow/6 above.

Link to this macro sequence_number(args \\ []) View Source (macro)
Link to this macro sequence_number(record, args) View Source (macro)
Link to this macro stream_record(args \\ []) View Source (macro)
Link to this macro stream_record(record, args) View Source (macro)