Dala.ML.Gpu.Inference (dala v0.6.0)

Copy Markdown View Source

GPU-accelerated ML inference via EXCubeCL.

Provides a high-level interface for running ML models on the GPU, integrating with Dala's existing Dala.ML modules and Nx tensors.

Architecture

Nx Tensor  GPU Buffer  CubeCL Kernels  GPU Buffer  Nx Tensor

Supported Models

Models are loaded from pre-compiled CubeCL kernel libraries:

  • :mobilenet_v2 — Image classification (224x224 RGB)
  • :yolo_v5 — Object detection (640x640 RGB)
  • :blazeface — Face detection (128x128 RGB)
  • :posenet — Pose estimation (257x257 RGB)
  • :deeplab — Semantic segmentation (513x513 RGB)

Example

# Load a model
{:ok, model} = Dala.ML.Gpu.load_model(:mobilenet_v2)

# Preprocess image
input_tensor = Dala.ML.preprocess(image_data, size: {224, 224})

# Run inference on GPU
{:ok, output} = Dala.ML.Gpu.predict(model, input_tensor)

# Post-process results
classes = Dala.ML.Gpu.top_k(output, k: 5)

GPU-to-GPU Frame Inference

For video pipelines, run inference directly on GPU frame buffers without CPU round-trip:

{:ok, model} = Dala.ML.Gpu.load_model(:mobilenet_v2)

# Load model from video frames (GPU textures)
{:ok, model} = Dala.ML.Gpu.load_model_from_frames(model, video_frames)

# Run inference on a single frame (GPU-to-GPU)
{:ok, output_tensor} = Dala.ML.Gpu.predict_frame(model, frame)

Integration with Dala.ML

This module complements (not replaces) the existing Dala.ML modules:

Use Dala.ML.predict/2 for automatic backend selection, or call this module directly for GPU-specific control.

Summary

Functions

List available pre-compiled models.

Free a model's GPU pipeline resources.

Load a pre-compiled model for GPU inference.

Load model weights from GPU video frames for GPU-to-GPU inference.

Return model metadata.

Run inference on a loaded model with an Nx tensor input.

Run inference asynchronously.

Run inference directly on a VideoFrame (GPU-to-GPU).

Return the top-k predictions from a classification output.

Types

model()

@type model() :: %Dala.ML.Gpu.Inference{
  input_shape: tuple(),
  name: atom(),
  output_shape: tuple(),
  pipeline_id: non_neg_integer() | nil,
  postprocess: atom(),
  preprocess: atom(),
  stages: [map()]
}

Functions

available_models()

@spec available_models() :: [atom()]

List available pre-compiled models.

free_model(inference)

@spec free_model(model()) :: :ok | {:error, term()}

Free a model's GPU pipeline resources.

load_model(name)

@spec load_model(atom()) :: {:ok, model()} | {:error, term()}

Load a pre-compiled model for GPU inference.

load_model_from_frames(model, frames)

@spec load_model_from_frames(model(), [ExCubecl.VideoFrame.t() | binary()]) ::
  {:ok, model()} | {:error, term()}

Load model weights from GPU video frames for GPU-to-GPU inference.

This enables processing of ExCubecl.VideoFrame structs without CPU round-trip. The frames are uploaded to GPU buffers and bound to the model pipeline.

Parameters

Returns

{:ok, updated_model} with frame buffers bound to the pipeline.

Example

frames = ExCubecl.VideoFrame.stream(camera_source, max_frames: 30)
{:ok, model} = Dala.ML.Gpu.load_model(:mobilenet_v2)
{:ok, model} = Dala.ML.Gpu.load_model_from_frames(model, frames)

model_info(name)

@spec model_info(atom()) :: map() | nil

Return model metadata.

predict(model, input_tensor)

@spec predict(model(), Nx.Tensor.t()) :: {:ok, Nx.Tensor.t()} | {:error, term()}

Run inference on a loaded model with an Nx tensor input.

predict_async(inference, input_tensor)

@spec predict_async(model(), Nx.Tensor.t()) :: {:ok, reference()} | {:error, term()}

Run inference asynchronously.

predict_frame(model, frame)

@spec predict_frame(model(), ExCubecl.VideoFrame.t() | binary()) ::
  {:ok, Nx.Tensor.t()} | {:error, term()}

Run inference directly on a VideoFrame (GPU-to-GPU).

This avoids CPU read-back by running the model pipeline directly on the GPU texture backing the VideoFrame. The output is still returned as an Nx tensor (requires one GPU→CPU read).

Parameters

Returns

{:ok, output_tensor} on success.

Example

{:ok, model} = Dala.ML.Gpu.load_model(:mobilenet_v2)
{:ok, model} = Dala.ML.Gpu.load_model_from_frames(model, calibration_frames)

# Process each frame in the video stream
for frame <- video_stream do
  {:ok, predictions} = Dala.ML.Gpu.predict_frame(model, frame)
  # Use predictions...
end

top_k(tensor, opts \\ [])

@spec top_k(
  Nx.Tensor.t(),
  keyword()
) :: [{number(), non_neg_integer()}]

Return the top-k predictions from a classification output.