High-level GPU compute orchestration for Dala.
Wraps EXCubeCL with Dala-native
patterns: GenServer-managed lifecycle, dirty-CPU scheduling, and integration
with Dala.Gpu surfaces, Dala.Media pipelines, and Dala.ML inference.
Architecture
┌──────────────────────────────────────────────────────┐
│ Dala.Gpu.Compute │
│ ├── buffer management (create, read, free) │
│ ├── kernel execution (sync + async) │
│ ├── pipeline orchestration (multi-stage) │
│ └── stream scheduler (mobile-optimized) │
├──────────────────────────────────────────────────────┤
│ EXCubeCL (Elixir NIF stubs) │
├──────────────────────────────────────────────────────┤
│ Rust NIF → CubeCL Runtime → Metal / OpenGL ES / CPU │
└──────────────────────────────────────────────────────┘Quick Start
# Check GPU availability
Dala.Gpu.Compute.device_info()
# %{name: "ExCubecl CPU (Rust NIF)", gpu: false, version: "0.3.0"}
# Create buffers
a = Dala.Gpu.Compute.buffer([1.0, 2.0, 3.0], {3}, :f32)
b = Dala.Gpu.Compute.buffer([4.0, 5.0, 6.0], {3}, :f32)
c = Dala.Gpu.Compute.buffer([0.0, 0.0, 0.0], {3}, :f32)
# Run a kernel
Dala.Gpu.Compute.run_kernel(:elementwise_add, [a, b], c, %{})
# Read results
Dala.Gpu.Compute.read(c)
# <<5.0, 7.0, 9.0>> (binary)
# Cleanup (optional — buffers auto-freed by ResourceArc)
Dala.Gpu.Compute.free(a)
Dala.Gpu.Compute.free(b)
Dala.Gpu.Compute.free(c)Async Execution
cmd_id = Dala.Gpu.Compute.submit(%{
op: :run_kernel,
kernel: "relu",
inputs: [a.ref],
output: b.ref,
params: %{}
})
Dala.Gpu.Compute.poll(cmd_id) # :pending | :completed | {:error, reason}
Dala.Gpu.Compute.wait(cmd_id) # blocks until donePipeline Orchestration
pipeline = Dala.Gpu.Compute.pipeline()
pipeline
|> Dala.Gpu.Compute.pipeline_add(%{
op: :run_kernel,
kernel: :blur,
inputs: [input_buf],
output: temp_buf,
params: %{radius: 3}
})
|> Dala.Gpu.Compute.pipeline_add(%{
op: :run_kernel,
kernel: :relu,
inputs: [temp_buf],
output: output_buf,
params: %{}
})
Dala.Gpu.Compute.pipeline_run(pipeline)Integration with Dala.Gpu surfaces
For rendering results to screen, pair a compute buffer with a Dala.Gpu.Surface:
{:ok, surface} = Dala.Gpu.create_surface(640, 480)
# Run compute → read buffer → upload to surface
Dala.Gpu.Compute.run_kernel(:generate_gradient, [], output_buf, %{})
pixels = Dala.Gpu.Compute.read(output_buf)
Dala.Gpu.set_pixels(surface, pixels)
Dala.Gpu.present(surface)Supported Types
| Type | Description |
|---|---|
:f32 | 32-bit float |
:f64 | 64-bit float |
:s32 | 32-bit signed integer |
:s64 | 64-bit signed integer |
:u32 | 32-bit unsigned integer |
:u8 | 8-bit unsigned integer |
Mobile Notes
On iOS, CubeCL kernels compile to Metal shaders at runtime. On Android, they compile to OpenGL ES compute shaders. On desktop (dev), a CPU fallback is used.
GPU compute is automatically dirty-CPU scheduled so it won't block the BEAM scheduler.
Summary
Functions
Elementwise addition: output = a + b
Create a GPU buffer from a list of values.
Create a GPU buffer from a raw binary.
Create an uninitialized GPU buffer with the given shape and dtype.
Return GPU device information.
Return the data type of a buffer.
Free a GPU buffer and release all associated GPU memory.
Free multiple GPU buffers at once.
Free a pipeline and its internal resources.
Convert an Nx tensor to a GPU buffer.
Return true if a real GPU is available (not CPU fallback).
Matrix multiplication: output = a * b
Elementwise multiply: output = a * b
Create a new empty GPU compute pipeline.
Add a stage to a pipeline. Returns the pipeline for chaining.
Execute all stages in a pipeline sequentially.
Poll an async command. Returns :pending, :completed, or {:error, reason}.
Read data from a GPU buffer back to an Elixir binary.
Read data from a GPU buffer as a raw binary (zero-copy when possible).
Read data from a GPU buffer and convert to an Elixir list.
Elementwise ReLU activation: output = max(0, input)
Run a named kernel synchronously.
Run a kernel asynchronously and wait for completion.
Run a compute kernel and upload the result directly to a GPU surface.
Scalar multiply: output = input * scalar
Return the shape of a buffer.
Return the size of a buffer in bytes.
Submit a compute command asynchronously. Returns a command ID for polling.
Convert a GPU buffer to an Nx tensor.
Return the EXCubeCL version string.
Block until an async command completes. Returns :ok or {:error, reason}.
Functions
@spec add( Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t() ) :: :ok | {:error, term()}
Elementwise addition: output = a + b
Example
c = Dala.Gpu.Compute.buffer_zeros({3}, :f32)
Dala.Gpu.Compute.add(a, b, c)
@spec buffer(list(), tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()
Create a GPU buffer from a list of values.
Options
:shape— tuple describing dimensions, e.g.{3}for a 1D vector of 3 elements:dtype— data type atom (:f32,:f64,:s32,:s64,:u32,:u8)
Example
buf = Dala.Gpu.Compute.buffer([1.0, 2.0, 3.0], {3}, :f32)
@spec buffer_from_binary(binary(), tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()
Create a GPU buffer from a raw binary.
Example
buf = Dala.Gpu.Compute.buffer_from_binary(binary_data, {640, 480, 4}, :u8)
@spec buffer_zeros(tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()
Create an uninitialized GPU buffer with the given shape and dtype.
Example
buf = Dala.Gpu.Compute.buffer_zeros({256, 256}, :f32)
@spec device_info() :: map()
Return GPU device information.
@spec dtype(Dala.Gpu.Compute.Buffer.t()) :: atom()
Return the data type of a buffer.
@spec free(Dala.Gpu.Compute.Buffer.t()) :: :ok
Free a GPU buffer and release all associated GPU memory.
@spec free_many([Dala.Gpu.Compute.Buffer.t()]) :: :ok
Free multiple GPU buffers at once.
@spec free_pipeline(Dala.Gpu.Compute.Pipeline.t()) :: :ok
Free a pipeline and its internal resources.
@spec from_nx(Nx.Tensor.t()) :: Dala.Gpu.Compute.Buffer.t()
Convert an Nx tensor to a GPU buffer.
Example
tensor = Nx.tensor([1.0, 2.0, 3.0])
buf = Dala.Gpu.Compute.from_nx(tensor)
@spec gpu?() :: boolean()
Return true if a real GPU is available (not CPU fallback).
@spec matmul( Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t() ) :: :ok | {:error, term()}
Matrix multiplication: output = a * b
Both buffers must be 2D. Shape validation is performed by the kernel.
Example
a = Dala.Gpu.Compute.buffer(list_4, {2, 2}, :f32)
b = Dala.Gpu.Compute.buffer(list_4, {2, 2}, :f32)
c = Dala.Gpu.Compute.buffer_zeros({2, 2}, :f32)
Dala.Gpu.Compute.matmul(a, b, c)
@spec multiply( Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t() ) :: :ok | {:error, term()}
Elementwise multiply: output = a * b
Example
Dala.Gpu.Compute.multiply(a, b, output)
@spec pipeline() :: Dala.Gpu.Compute.Pipeline.t()
Create a new empty GPU compute pipeline.
@spec pipeline_add(Dala.Gpu.Compute.Pipeline.t(), map()) :: Dala.Gpu.Compute.Pipeline.t()
Add a stage to a pipeline. Returns the pipeline for chaining.
@spec pipeline_run(Dala.Gpu.Compute.Pipeline.t()) :: :ok | {:error, term()}
Execute all stages in a pipeline sequentially.
@spec poll(non_neg_integer()) :: :pending | :completed | {:error, term()}
Poll an async command. Returns :pending, :completed, or {:error, reason}.
@spec read(Dala.Gpu.Compute.Buffer.t()) :: binary()
Read data from a GPU buffer back to an Elixir binary.
EXCubeCL 0.3+ returns {:ok, binary()} from read/1.
This function returns the binary directly for a clean Dala API.
Example
data = Dala.Gpu.Compute.read(buf)
# <<5.0, 7.0, 9.0>>
@spec read_binary(Dala.Gpu.Compute.Buffer.t()) :: binary()
Read data from a GPU buffer as a raw binary (zero-copy when possible).
@spec read_list(Dala.Gpu.Compute.Buffer.t()) :: list()
Read data from a GPU buffer and convert to an Elixir list.
@spec relu(Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t()) :: :ok | {:error, term()}
Elementwise ReLU activation: output = max(0, input)
Example
Dala.Gpu.Compute.relu(input, output)
@spec run_kernel( atom(), [Dala.Gpu.Compute.Buffer.t()], Dala.Gpu.Compute.Buffer.t(), map() ) :: :ok | {:error, term()}
Run a named kernel synchronously.
Parameters
kernel— kernel atom (e.g.:elementwise_add,:relu,:blur)inputs— list of inputBufferstructsoutput— outputBufferstructparams— map of kernel-specific parameters
Example
Dala.Gpu.Compute.run_kernel(:elementwise_add, [a, b], c, %{})
Dala.Gpu.Compute.run_kernel(:relu, [a], b, %{slope: 0.1})
Dala.Gpu.Compute.run_kernel(:blur, [image_buf], out_buf, %{radius: 3, sigma: 1.5})
@spec run_kernel_async( atom(), [Dala.Gpu.Compute.Buffer.t()], Dala.Gpu.Compute.Buffer.t(), map() ) :: :ok | {:error, term()}
Run a kernel asynchronously and wait for completion.
@spec run_to_surface( atom(), [Dala.Gpu.Compute.Buffer.t()], Dala.Gpu.Compute.Buffer.t(), pid(), map() ) :: :ok | {:error, term()}
Run a compute kernel and upload the result directly to a GPU surface.
This is a convenience function that combines kernel execution with surface pixel upload, avoiding an intermediate read-back to the CPU.
Example
{:ok, surface} = Dala.Gpu.create_surface(640, 480)
Dala.Gpu.Compute.run_to_surface(kernel, [input_buf], output_buf, surface, %{})
@spec scale(Dala.Gpu.Compute.Buffer.t(), number(), Dala.Gpu.Compute.Buffer.t()) :: :ok | {:error, term()}
Scalar multiply: output = input * scalar
Example
Dala.Gpu.Compute.scale(input, 2.5, output)
@spec shape(Dala.Gpu.Compute.Buffer.t()) :: tuple()
Return the shape of a buffer.
@spec size(Dala.Gpu.Compute.Buffer.t()) :: non_neg_integer()
Return the size of a buffer in bytes.
@spec submit(map()) :: non_neg_integer()
Submit a compute command asynchronously. Returns a command ID for polling.
The spec map is encoded as a string for EXCubeCL 0.3+.
Example
cmd_id = Dala.Gpu.Compute.submit(%{
op: :run_kernel,
kernel: "relu",
inputs: [a.ref],
output: b.ref,
params: %{}
})
# Later...
case Dala.Gpu.Compute.poll(cmd_id) do
:completed -> Dala.Gpu.Compute.read(b)
{:error, reason} -> handle_error(reason)
:pending -> retry_later()
end
@spec to_nx(Dala.Gpu.Compute.Buffer.t(), tuple(), atom()) :: Nx.Tensor.t()
Convert a GPU buffer to an Nx tensor.
Example
tensor = Dala.Gpu.Compute.to_nx(buf, {3}, :f32)
@spec version() :: String.t()
Return the EXCubeCL version string.
@spec wait(non_neg_integer()) :: :ok | {:error, term()}
Block until an async command completes. Returns :ok or {:error, reason}.