Feature extraction (v2) from a file's content.
This is a faithful port of Magika's _extract_features_from_seekable logic:
- Read at most
block_sizebytes from the beginning and from the end. - Strip ASCII whitespace (
lstripfor the beginning,rstripfor the end). - Take
beg_sizebytes from the start andend_sizebytes from the end, padding withpadding_tokenwhen there are not enough bytes.
The resulting feature vector is beg ++ end, a list of beg_size + end_size
integers, suitable for feeding to the model.
mid features and use_inputs_at_offsets are not implemented, matching the
current reference implementation (which asserts mid_size == 0 and
use_inputs_at_offsets == false).
Summary
Functions
Extracts the model feature vector from content.
Returns the beg portion of the feature vector only.
Functions
@spec extract(binary(), Magika.Config.t()) :: [non_neg_integer()]
Extracts the model feature vector from content.
Returns a list of beg_size + end_size integers.
@spec extract_beg(binary(), Magika.Config.t()) :: [non_neg_integer()]
Returns the beg portion of the feature vector only.
Used to replicate the reference check on whether, post-stripping, there are
enough meaningful bytes (i.e. beg[min_file_size_for_dl - 1] is not padding).