Magika.Features (Magika v0.1.0-rc.0)

Copy Markdown View Source

Feature extraction (v2) from a file's content.

This is a faithful port of Magika's _extract_features_from_seekable logic:

  • Read at most block_size bytes from the beginning and from the end.
  • Strip ASCII whitespace (lstrip for the beginning, rstrip for the end).
  • Take beg_size bytes from the start and end_size bytes from the end, padding with padding_token when there are not enough bytes.

The resulting feature vector is beg ++ end, a list of beg_size + end_size integers, suitable for feeding to the model.

mid features and use_inputs_at_offsets are not implemented, matching the current reference implementation (which asserts mid_size == 0 and use_inputs_at_offsets == false).

Summary

Functions

Extracts the model feature vector from content.

Returns the beg portion of the feature vector only.

Functions

extract(content, config)

@spec extract(binary(), Magika.Config.t()) :: [non_neg_integer()]

Extracts the model feature vector from content.

Returns a list of beg_size + end_size integers.

extract_beg(content, config)

@spec extract_beg(binary(), Magika.Config.t()) :: [non_neg_integer()]

Returns the beg portion of the feature vector only.

Used to replicate the reference check on whether, post-stripping, there are enough meaningful bytes (i.e. beg[min_file_size_for_dl - 1] is not padding).