View Source Exmld.KinesisWorker (exmld v1.0.4)
An erlmld_flusher
which can interface with a Exmld.KinesisStage
data source.
This implements an erlmld_flusher
which can be used by erlmld_batch_processor
.
Unlike a typical erlmld_flusher
, it has a different notion of fullness: if more than
:max_pending
items are in flight, the worker waits for all pending items before
emitting any more for downstream processing. A periodic flush interval should be
configured in the batch processor options. Similarly, the downstream stage processing
pipeline should not require any kind of "full" condition and should periodically make
progress (i.e., emit/flush output) even if no more records are sent.
Heartbeat items are sent while the worker is waiting for pending items to be completed; these include varying counters to allow them to be automatically distributed among downstream reducers.
One worker process will exist for each stream shard owned by the current node. Each
such process will have been configured with a set of downstream Exmld.KinesisStage
s
which can receive records from it (actually Exmld.KinesisWorker.Datum
s); those stages
will be part of a data processing Flow.t
. Eventually, the disposition of each
record's processing will propagate back to the originating worker (as return values from
GenStage.call/3
).
Periodically, erlmld_batch_processor
will request a flush. If the flush kind is
:partial
, we return the tokens associated with the records which have already been
fully processed. Otherwise, the flush kind is :full
and we await the disposition of
every outstanding record before returning.
If processing of any record (or item extracted therefrom) fails, the worker will crash unless it's configured to ignore processing errors.
Records presented to this worker may be ordinary records or sub-records extracted from a containing KPL-aggregated record. If KPL aggregation is not being used, but smaller sub-items are later extracted by the stage processing pipeline, the pipeline should create fake sub-record sequence numbers to track the disposition of those items (and sub-record checkpointing should be turned off).
Periodically (which should be at some multiple of the periodic flush interval),
erlmld_batch_processor
will checkpoint based on the records which have so far been
successfully processed (those whose tokens have been returned from flush/2
).
Link to this section Summary
Functions
Submit a new Kinesis record to the downstream pipeline for processing.
Return a list of tokens corresponding to records which have been fully processed and the latest state.
The batch processor has received a possibly-empty set of records from the MultiLangDaemon and is informing the flusher (us). Send a heartbeat downstream and return any completed tokens. This allows progress to be made even if no more records appear on the stream.
Initialize worker state with a shard id and a set of options.
Link to this section Types
@type flusher_token() :: any()
@type t() :: %Exmld.KinesisWorker{ await_sleep_interval: non_neg_integer(), counter: non_neg_integer(), done: [flusher_token()], error_callback: (t(), [Exmld.KinesisWorker.Disposition.t()] -> any()) | nil, errors: non_neg_integer(), heartbeats: non_neg_integer(), max_pending: pos_integer(), on_duplicate: :exit | :skip, opaque: any(), pending: %{ optional({Exmld.shard_id(), Exmld.sequence_number()}) => flusher_token() | {flusher_token(), [{non_neg_integer(), non_neg_integer()}]} }, shard_id: Exmld.shard_id(), skip_errors: boolean(), stages: [any()] }
Link to this section Functions
Submit a new Kinesis record to the downstream pipeline for processing.
A new Kinesis record is available for processing, and erlmld_batch_processor
is
instructing us to add it to the current batch. Since we really have no notion of a
batch, we immediately choose a downstream stage and notify it of a new
Exmld.KinesisWorker.Datum
containing the record and make a note of it being in-flight.
That call will block until a further-downstream consumer receives the record as a flow
event.
The result of that call will be an updated list of item dispositions. Unless configured to skip records which failed to be processed, we crash if any failed. Otherwise we update the set of done/pending items and return an updated state.
Return a list of tokens corresponding to records which have been fully processed and the latest state.
If the flush kind is :full
, we await the disposition of all outstanding records before
returning. Otherwise, it's :partial
and we return (possibly an empty result)
immediately.
If doing a full flush and any records fail to be successfully processed, we crash unless configured to skip failed records.
The batch processor has received a possibly-empty set of records from the MultiLangDaemon and is informing the flusher (us). Send a heartbeat downstream and return any completed tokens. This allows progress to be made even if no more records appear on the stream.
Initialize worker state with a shard id and a set of options.
An erlmld_batch_processor
is initializing processing on shard_id
and providing the
flusher_mod_data
which was passed to it, which should be an enumerable of keyword
s
containing the following options; we return a flusher state to be used in subsequent
operations.
options
Options
All optional unless marked required:
:stages
- (required) list ofGenStage
s (values useable as first arg toGenStage.call/3
) which can receiveExmld.KinesisWorker.Datum
s:opaque
- opaque term passed in eachExmld.KinesisWorker.Datum
:skip_errors
- boolean indicating whether errors are non-fatal (if false, crash on error).:max_pending
- maximum number of pending items which can be in flight.:await_sleep_interval
- sleep time between checks while awaiting pending items.:error_callback
-nil
or an arity-2 function called with state and failure dispositions when processing failures occur.