View Source Exmld.KinesisWorker (exmld v1.0.4)

An erlmld_flusher which can interface with a Exmld.KinesisStage data source.

This implements an erlmld_flusher which can be used by erlmld_batch_processor. Unlike a typical erlmld_flusher, it has a different notion of fullness: if more than :max_pending items are in flight, the worker waits for all pending items before emitting any more for downstream processing. A periodic flush interval should be configured in the batch processor options. Similarly, the downstream stage processing pipeline should not require any kind of "full" condition and should periodically make progress (i.e., emit/flush output) even if no more records are sent.

Heartbeat items are sent while the worker is waiting for pending items to be completed; these include varying counters to allow them to be automatically distributed among downstream reducers.

One worker process will exist for each stream shard owned by the current node. Each such process will have been configured with a set of downstream Exmld.KinesisStages which can receive records from it (actually Exmld.KinesisWorker.Datums); those stages will be part of a data processing Flow.t. Eventually, the disposition of each record's processing will propagate back to the originating worker (as return values from GenStage.call/3).

Periodically, erlmld_batch_processor will request a flush. If the flush kind is :partial, we return the tokens associated with the records which have already been fully processed. Otherwise, the flush kind is :full and we await the disposition of every outstanding record before returning.

If processing of any record (or item extracted therefrom) fails, the worker will crash unless it's configured to ignore processing errors.

Records presented to this worker may be ordinary records or sub-records extracted from a containing KPL-aggregated record. If KPL aggregation is not being used, but smaller sub-items are later extracted by the stage processing pipeline, the pipeline should create fake sub-record sequence numbers to track the disposition of those items (and sub-record checkpointing should be turned off).

Periodically (which should be at some multiple of the periodic flush interval), erlmld_batch_processor will checkpoint based on the records which have so far been successfully processed (those whose tokens have been returned from flush/2).

Link to this section Summary

Functions

Submit a new Kinesis record to the downstream pipeline for processing.

Return a list of tokens corresponding to records which have been fully processed and the latest state.

The batch processor has received a possibly-empty set of records from the MultiLangDaemon and is informing the flusher (us). Send a heartbeat downstream and return any completed tokens. This allows progress to be made even if no more records appear on the stream.

Initialize worker state with a shard id and a set of options.

Link to this section Types

@type flusher_token() :: any()
@type t() :: %Exmld.KinesisWorker{
  await_sleep_interval: non_neg_integer(),
  counter: non_neg_integer(),
  done: [flusher_token()],
  error_callback: (t(), [Exmld.KinesisWorker.Disposition.t()] -> any()) | nil,
  errors: non_neg_integer(),
  heartbeats: non_neg_integer(),
  max_pending: pos_integer(),
  on_duplicate: :exit | :skip,
  opaque: any(),
  pending: %{
    optional({Exmld.shard_id(), Exmld.sequence_number()}) =>
      flusher_token()
      | {flusher_token(), [{non_neg_integer(), non_neg_integer()}]}
  },
  shard_id: Exmld.shard_id(),
  skip_errors: boolean(),
  stages: [any()]
}

Link to this section Functions

Link to this function

add_record(state, record, token)

View Source

Submit a new Kinesis record to the downstream pipeline for processing.

A new Kinesis record is available for processing, and erlmld_batch_processor is instructing us to add it to the current batch. Since we really have no notion of a batch, we immediately choose a downstream stage and notify it of a new Exmld.KinesisWorker.Datum containing the record and make a note of it being in-flight. That call will block until a further-downstream consumer receives the record as a flow event.

The result of that call will be an updated list of item dispositions. Unless configured to skip records which failed to be processed, we crash if any failed. Otherwise we update the set of done/pending items and return an updated state.

Return a list of tokens corresponding to records which have been fully processed and the latest state.

If the flush kind is :full, we await the disposition of all outstanding records before returning. Otherwise, it's :partial and we return (possibly an empty result) immediately.

If doing a full flush and any records fail to be successfully processed, we crash unless configured to skip failed records.

The batch processor has received a possibly-empty set of records from the MultiLangDaemon and is informing the flusher (us). Send a heartbeat downstream and return any completed tokens. This allows progress to be made even if no more records appear on the stream.

Initialize worker state with a shard id and a set of options.

An erlmld_batch_processor is initializing processing on shard_id and providing the flusher_mod_data which was passed to it, which should be an enumerable of keywords containing the following options; we return a flusher state to be used in subsequent operations.

options

Options

All optional unless marked required:

  • :stages - (required) list of GenStages (values useable as first arg to GenStage.call/3) which can receive Exmld.KinesisWorker.Datums
  • :opaque - opaque term passed in each Exmld.KinesisWorker.Datum
  • :skip_errors - boolean indicating whether errors are non-fatal (if false, crash on error).
  • :max_pending - maximum number of pending items which can be in flight.
  • :await_sleep_interval - sleep time between checks while awaiting pending items.
  • :error_callback - nil or an arity-2 function called with state and failure dispositions when processing failures occur.