Magika (Magika v0.1.0-rc.0)

Copy Markdown View Source

Elixir binding of Magika, Google's deep-learning file content type detector.

Magika identifies the content type of a file (e.g. html, python, pdf, zip) from its bytes, using a small ONNX model run via OnnxRuntime. It is a faithful port of the reference Python implementation's standard_v3_3 model and inference logic.

Usage

The model is loaded once and hosted by a supervised Magika.Server that starts automatically with the :magika application. Call the API without threading an instance around:

{:ok, result} = Magika.identify("<!DOCTYPE html>\n<html>...</html>")
result.prediction.output.label       #=> "html"
result.prediction.output.mime_type   #=> "text/html"
result.prediction.score              #=> 0.99...

{:ok, result} = Magika.identify_path("/path/to/file.pdf")
result.prediction.output.label       #=> "pdf"

Prediction mode

The prediction mode controls how strict Magika is before trusting the model's guess. The hosted server uses :high_confidence by default; change it in your application config:

config :magika, prediction_mode: :best_guess

The modes:

  • :high_confidence (default) — keep the model prediction only when its score clears the per-content-type threshold (falling back to the medium-confidence threshold otherwise).
  • :medium_confidence — keep the model prediction when its score clears the generic medium-confidence threshold.
  • :best_guess — always return the model prediction regardless of score.

When the score is too low for the chosen mode, the output is generalized to txt (for text content types) or unknown (for binary content types).

Standalone instances (advanced)

You normally don't need this. For one-off scripts or tests you can build an instance with new/1 and pass it as the first argument, bypassing the supervised server. A Magika.t() is immutable and safe to reuse:

magika = Magika.new(prediction_mode: :best_guess)
{:ok, result} = Magika.identify(magika, content)

A specific named server can also be targeted with the :server option:

{:ok, result} = Magika.identify(content, server: MyApp.Magika)

Summary

Functions

Identifies the content type of the given raw content (a binary).

Identifies the content type of the file at path.

Identifies the content type read from an open binary IO.device/file.

Returns the loaded model's name (the model directory basename).

Creates a new Magika instance, loading the model and configuration.

Types

prediction_mode()

@type prediction_mode() :: :high_confidence | :medium_confidence | :best_guess

t()

@type t() :: %Magika{
  config: Magika.Config.t(),
  model: OnnxRuntime.Model.t(),
  prediction_mode: prediction_mode()
}

Functions

identify(content, opts \\ [])

@spec identify(
  binary(),
  keyword()
) :: {:ok, Magika.Result.t()}
@spec identify(t(), binary()) :: {:ok, Magika.Result.t()}

Identifies the content type of the given raw content (a binary).

Resolves the hosted instance from a Magika.Server. Pass server: to target a specific named server (defaults to Magika.Server). Alternatively, pass a Magika instance as the first argument to bypass the server entirely.

Always returns {:ok, result} — identification of in-memory bytes cannot fail the way a filesystem read can.

identify_path(path, opts \\ [])

@spec identify_path(
  Path.t(),
  keyword()
) :: {:ok, Magika.Result.t()} | {:error, Magika.Result.t()}
@spec identify_path(t(), Path.t()) ::
  {:ok, Magika.Result.t()} | {:error, Magika.Result.t()}

Identifies the content type of the file at path.

Resolves the hosted instance from a Magika.Server. Pass server: to target a specific named server (defaults to Magika.Server). Alternatively, pass a Magika instance as the first argument to bypass the server entirely.

Returns {:ok, result} on success, or {:error, result} when the path does not exist or cannot be read. Directories and other special files are reported via dedicated content types (directory, symlink, unknown).

identify_stream(device, opts \\ [])

@spec identify_stream(
  IO.device(),
  keyword()
) :: {:ok, Magika.Result.t()}
@spec identify_stream(t(), IO.device()) :: {:ok, Magika.Result.t()}

Identifies the content type read from an open binary IO.device/file.

Resolves the hosted instance from a Magika.Server. Pass server: to target a specific named server (defaults to Magika.Server). Alternatively, pass a Magika instance as the first argument to bypass the server entirely.

The whole stream is read into memory (Magika only needs a bounded prefix and suffix, but reading fully keeps the implementation simple and correct). The caller is responsible for opening and closing the device.

model_name(magika)

@spec model_name(t()) :: String.t()

Returns the loaded model's name (the model directory basename).

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new Magika instance, loading the model and configuration.

Options

  • :prediction_mode — one of :high_confidence (default), :medium_confidence, :best_guess.
  • :model_path — path to a custom model.onnx. Defaults to the vendored standard_v3_3 model.
  • :model_config_path — path to a custom config.min.json.
  • :content_types_kb_path — path to a custom content_types_kb.min.json.