Dala.Gpu.Compute (dala v0.6.0)

Copy Markdown View Source

High-level GPU compute orchestration for Dala.

Wraps EXCubeCL with Dala-native patterns: GenServer-managed lifecycle, dirty-CPU scheduling, and integration with Dala.Gpu surfaces, Dala.Media pipelines, and Dala.ML inference.

Architecture


  Dala.Gpu.Compute                                    
   buffer management (create, read, free)           
   kernel execution (sync + async)                  
   pipeline orchestration (multi-stage)             
   stream scheduler (mobile-optimized)              

  EXCubeCL (Elixir NIF stubs)                         

  Rust NIF  CubeCL Runtime  Metal / OpenGL ES / CPU 

Quick Start

# Check GPU availability
Dala.Gpu.Compute.device_info()
# %{name: "ExCubecl CPU (Rust NIF)", gpu: false, version: "0.3.0"}

# Create buffers
a = Dala.Gpu.Compute.buffer([1.0, 2.0, 3.0], {3}, :f32)
b = Dala.Gpu.Compute.buffer([4.0, 5.0, 6.0], {3}, :f32)
c = Dala.Gpu.Compute.buffer([0.0, 0.0, 0.0], {3}, :f32)

# Run a kernel
Dala.Gpu.Compute.run_kernel(:elementwise_add, [a, b], c, %{})

# Read results
Dala.Gpu.Compute.read(c)
# <<5.0, 7.0, 9.0>>  (binary)

# Cleanup (optional — buffers auto-freed by ResourceArc)
Dala.Gpu.Compute.free(a)
Dala.Gpu.Compute.free(b)
Dala.Gpu.Compute.free(c)

Async Execution

cmd_id = Dala.Gpu.Compute.submit(%{
  op: :run_kernel,
  kernel: "relu",
  inputs: [a.ref],
  output: b.ref,
  params: %{}
})

Dala.Gpu.Compute.poll(cmd_id)  # :pending | :completed | {:error, reason}
Dala.Gpu.Compute.wait(cmd_id)   # blocks until done

Pipeline Orchestration

pipeline = Dala.Gpu.Compute.pipeline()
pipeline
|> Dala.Gpu.Compute.pipeline_add(%{
  op: :run_kernel,
  kernel: :blur,
  inputs: [input_buf],
  output: temp_buf,
  params: %{radius: 3}
})
|> Dala.Gpu.Compute.pipeline_add(%{
  op: :run_kernel,
  kernel: :relu,
  inputs: [temp_buf],
  output: output_buf,
  params: %{}
})
Dala.Gpu.Compute.pipeline_run(pipeline)

Integration with Dala.Gpu surfaces

For rendering results to screen, pair a compute buffer with a Dala.Gpu.Surface:

{:ok, surface} = Dala.Gpu.create_surface(640, 480)
# Run compute → read buffer → upload to surface
Dala.Gpu.Compute.run_kernel(:generate_gradient, [], output_buf, %{})
pixels = Dala.Gpu.Compute.read(output_buf)
Dala.Gpu.set_pixels(surface, pixels)
Dala.Gpu.present(surface)

Supported Types

TypeDescription
:f3232-bit float
:f6464-bit float
:s3232-bit signed integer
:s6464-bit signed integer
:u3232-bit unsigned integer
:u88-bit unsigned integer

Mobile Notes

On iOS, CubeCL kernels compile to Metal shaders at runtime. On Android, they compile to OpenGL ES compute shaders. On desktop (dev), a CPU fallback is used.

GPU compute is automatically dirty-CPU scheduled so it won't block the BEAM scheduler.

Summary

Functions

Elementwise addition: output = a + b

Create a GPU buffer from a list of values.

Create a GPU buffer from a raw binary.

Create an uninitialized GPU buffer with the given shape and dtype.

Return GPU device information.

Return the data type of a buffer.

Free a GPU buffer and release all associated GPU memory.

Free multiple GPU buffers at once.

Free a pipeline and its internal resources.

Convert an Nx tensor to a GPU buffer.

Return true if a real GPU is available (not CPU fallback).

Matrix multiplication: output = a * b

Elementwise multiply: output = a * b

Create a new empty GPU compute pipeline.

Add a stage to a pipeline. Returns the pipeline for chaining.

Execute all stages in a pipeline sequentially.

Poll an async command. Returns :pending, :completed, or {:error, reason}.

Read data from a GPU buffer back to an Elixir binary.

Read data from a GPU buffer as a raw binary (zero-copy when possible).

Read data from a GPU buffer and convert to an Elixir list.

Elementwise ReLU activation: output = max(0, input)

Run a named kernel synchronously.

Run a kernel asynchronously and wait for completion.

Run a compute kernel and upload the result directly to a GPU surface.

Scalar multiply: output = input * scalar

Return the shape of a buffer.

Return the size of a buffer in bytes.

Submit a compute command asynchronously. Returns a command ID for polling.

Convert a GPU buffer to an Nx tensor.

Return the EXCubeCL version string.

Block until an async command completes. Returns :ok or {:error, reason}.

Functions

add(a, b, output)

Elementwise addition: output = a + b

Example

c = Dala.Gpu.Compute.buffer_zeros({3}, :f32)
Dala.Gpu.Compute.add(a, b, c)

buffer(data, shape, dtype \\ :f32)

@spec buffer(list(), tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()

Create a GPU buffer from a list of values.

Options

  • :shape — tuple describing dimensions, e.g. {3} for a 1D vector of 3 elements
  • :dtype — data type atom (:f32, :f64, :s32, :s64, :u32, :u8)

Example

buf = Dala.Gpu.Compute.buffer([1.0, 2.0, 3.0], {3}, :f32)

buffer_from_binary(data, shape, dtype \\ :u8)

@spec buffer_from_binary(binary(), tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()

Create a GPU buffer from a raw binary.

Example

buf = Dala.Gpu.Compute.buffer_from_binary(binary_data, {640, 480, 4}, :u8)

buffer_zeros(shape, dtype \\ :f32)

@spec buffer_zeros(tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()

Create an uninitialized GPU buffer with the given shape and dtype.

Example

buf = Dala.Gpu.Compute.buffer_zeros({256, 256}, :f32)

device_info()

@spec device_info() :: map()

Return GPU device information.

dtype(buffer)

@spec dtype(Dala.Gpu.Compute.Buffer.t()) :: atom()

Return the data type of a buffer.

free(buffer)

@spec free(Dala.Gpu.Compute.Buffer.t()) :: :ok

Free a GPU buffer and release all associated GPU memory.

free_many(buffers)

@spec free_many([Dala.Gpu.Compute.Buffer.t()]) :: :ok

Free multiple GPU buffers at once.

free_pipeline(pipeline)

@spec free_pipeline(Dala.Gpu.Compute.Pipeline.t()) :: :ok

Free a pipeline and its internal resources.

from_nx(tensor)

@spec from_nx(Nx.Tensor.t()) :: Dala.Gpu.Compute.Buffer.t()

Convert an Nx tensor to a GPU buffer.

Example

tensor = Nx.tensor([1.0, 2.0, 3.0])
buf = Dala.Gpu.Compute.from_nx(tensor)

gpu?()

@spec gpu?() :: boolean()

Return true if a real GPU is available (not CPU fallback).

matmul(a, b, output)

Matrix multiplication: output = a * b

Both buffers must be 2D. Shape validation is performed by the kernel.

Example

a = Dala.Gpu.Compute.buffer(list_4, {2, 2}, :f32)
b = Dala.Gpu.Compute.buffer(list_4, {2, 2}, :f32)
c = Dala.Gpu.Compute.buffer_zeros({2, 2}, :f32)
Dala.Gpu.Compute.matmul(a, b, c)

multiply(a, b, output)

Elementwise multiply: output = a * b

Example

Dala.Gpu.Compute.multiply(a, b, output)

pipeline()

@spec pipeline() :: Dala.Gpu.Compute.Pipeline.t()

Create a new empty GPU compute pipeline.

pipeline_add(pipeline, stage_spec)

Add a stage to a pipeline. Returns the pipeline for chaining.

pipeline_run(pipeline)

@spec pipeline_run(Dala.Gpu.Compute.Pipeline.t()) :: :ok | {:error, term()}

Execute all stages in a pipeline sequentially.

poll(cmd_id)

@spec poll(non_neg_integer()) :: :pending | :completed | {:error, term()}

Poll an async command. Returns :pending, :completed, or {:error, reason}.

read(buffer)

@spec read(Dala.Gpu.Compute.Buffer.t()) :: binary()

Read data from a GPU buffer back to an Elixir binary.

EXCubeCL 0.3+ returns {:ok, binary()} from read/1. This function returns the binary directly for a clean Dala API.

Example

data = Dala.Gpu.Compute.read(buf)
# <<5.0, 7.0, 9.0>>

read_binary(buffer)

@spec read_binary(Dala.Gpu.Compute.Buffer.t()) :: binary()

Read data from a GPU buffer as a raw binary (zero-copy when possible).

read_list(buffer)

@spec read_list(Dala.Gpu.Compute.Buffer.t()) :: list()

Read data from a GPU buffer and convert to an Elixir list.

relu(input, output)

@spec relu(Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t()) ::
  :ok | {:error, term()}

Elementwise ReLU activation: output = max(0, input)

Example

Dala.Gpu.Compute.relu(input, output)

run_kernel(kernel, inputs, output, params \\ %{})

@spec run_kernel(
  atom(),
  [Dala.Gpu.Compute.Buffer.t()],
  Dala.Gpu.Compute.Buffer.t(),
  map()
) ::
  :ok | {:error, term()}

Run a named kernel synchronously.

Parameters

  • kernel — kernel atom (e.g. :elementwise_add, :relu, :blur)
  • inputs — list of input Buffer structs
  • output — output Buffer struct
  • params — map of kernel-specific parameters

Example

Dala.Gpu.Compute.run_kernel(:elementwise_add, [a, b], c, %{})
Dala.Gpu.Compute.run_kernel(:relu, [a], b, %{slope: 0.1})
Dala.Gpu.Compute.run_kernel(:blur, [image_buf], out_buf, %{radius: 3, sigma: 1.5})

run_kernel_async(kernel, inputs, output, params \\ %{})

@spec run_kernel_async(
  atom(),
  [Dala.Gpu.Compute.Buffer.t()],
  Dala.Gpu.Compute.Buffer.t(),
  map()
) ::
  :ok | {:error, term()}

Run a kernel asynchronously and wait for completion.

run_to_surface(kernel, inputs, output, surface, params \\ %{})

@spec run_to_surface(
  atom(),
  [Dala.Gpu.Compute.Buffer.t()],
  Dala.Gpu.Compute.Buffer.t(),
  pid(),
  map()
) ::
  :ok | {:error, term()}

Run a compute kernel and upload the result directly to a GPU surface.

This is a convenience function that combines kernel execution with surface pixel upload, avoiding an intermediate read-back to the CPU.

Example

{:ok, surface} = Dala.Gpu.create_surface(640, 480)
Dala.Gpu.Compute.run_to_surface(kernel, [input_buf], output_buf, surface, %{})

scale(input, scalar, output)

@spec scale(Dala.Gpu.Compute.Buffer.t(), number(), Dala.Gpu.Compute.Buffer.t()) ::
  :ok | {:error, term()}

Scalar multiply: output = input * scalar

Example

Dala.Gpu.Compute.scale(input, 2.5, output)

shape(buffer)

@spec shape(Dala.Gpu.Compute.Buffer.t()) :: tuple()

Return the shape of a buffer.

size(buffer)

Return the size of a buffer in bytes.

submit(spec)

@spec submit(map()) :: non_neg_integer()

Submit a compute command asynchronously. Returns a command ID for polling.

The spec map is encoded as a string for EXCubeCL 0.3+.

Example

cmd_id = Dala.Gpu.Compute.submit(%{
  op: :run_kernel,
  kernel: "relu",
  inputs: [a.ref],
  output: b.ref,
  params: %{}
})

# Later...
case Dala.Gpu.Compute.poll(cmd_id) do
  :completed -> Dala.Gpu.Compute.read(b)
  {:error, reason} -> handle_error(reason)
  :pending -> retry_later()
end

to_nx(buf, shape, dtype)

@spec to_nx(Dala.Gpu.Compute.Buffer.t(), tuple(), atom()) :: Nx.Tensor.t()

Convert a GPU buffer to an Nx tensor.

Example

tensor = Dala.Gpu.Compute.to_nx(buf, {3}, :f32)

version()

@spec version() :: String.t()

Return the EXCubeCL version string.

wait(cmd_id)

@spec wait(non_neg_integer()) :: :ok | {:error, term()}

Block until an async command completes. Returns :ok or {:error, reason}.