Behaviour every compute provider must implement.
A "provider" is any module that can spawn, control, and terminate GPU (or
CPU) compute resources on some cloud. ExAtlas ships a full RunPod implementation
and stubs for Fly.io Machines, Lambda Labs, and Vast.ai. Users can supply
their own module — the top-level ExAtlas API accepts any module name as a
:provider value, so in-house clouds or test doubles plug in without a PR.
Contract summary
All callbacks receive a ctx — a map holding the API key and any per-call
overrides resolved by ExAtlas.Config. Callbacks return either a normalized
struct (ExAtlas.Spec.Compute, ExAtlas.Spec.Job, ...) or a tagged error tuple
shaped by ExAtlas.Error.
Capabilities
Not every provider supports every operation. capabilities/0 returns the
list of atoms the provider honors (e.g. :serverless, :spot, :http_proxy).
Callers that depend on an optional feature should check capabilities first
rather than catching {:error, %ExAtlas.Error{kind: :unsupported}}.
Writing your own provider
defmodule MyCloud.Provider do
@behaviour ExAtlas.Provider
@impl true
def spawn_compute(%ExAtlas.Spec.ComputeRequest{} = req, ctx) do
# translate `req` into MyCloud's native payload and POST it
end
@impl true
def capabilities, do: [:http_proxy]
# ... all other callbacks ...
end
# Use it
ExAtlas.spawn_compute([provider: MyCloud.Provider, gpu: :a100_80g, ...])
Summary
Callbacks
Cancel an in-flight job.
List the capabilities the provider honors. Examples
Fetch the current state of a resource by provider id.
Fetch a job's status by id.
List resources; providers should honor at minimum :status and :name filters.
Return the provider's catalog of GPU types and current prices.
Submit a serverless job. Returns {:error, :unsupported} if the provider has no serverless.
Provision a compute resource from a normalized ComputeRequest.
Resume a previously stopped resource.
Stop a resource without destroying its storage (resume-able).
Stream intermediate outputs for a job. Returns a lazy Enumerable.
Destroy a resource and its ephemeral storage.
Types
Callbacks
Cancel an in-flight job.
@callback capabilities() :: [atom()]
List the capabilities the provider honors. Examples:
:spot— can rent interruptible instances:serverless— supportsrun_job/2:network_volumes— can attach persistent storage:http_proxy— auto-terminated HTTPS proxy per pod:raw_tcp— public IP + mapped TCP ports:symmetric_ports— inside-port == outside-port guarantee:webhooks— push completion callbacks:global_networking— private networking across datacenters
@callback get_compute(id(), ctx()) :: result(ExAtlas.Spec.Compute.t())
Fetch the current state of a resource by provider id.
@callback get_job(id(), ctx()) :: result(ExAtlas.Spec.Job.t())
Fetch a job's status by id.
@callback list_compute( keyword(), ctx() ) :: result([ExAtlas.Spec.Compute.t()])
List resources; providers should honor at minimum :status and :name filters.
@callback list_gpu_types(ctx()) :: result([ExAtlas.Spec.GpuType.t()])
Return the provider's catalog of GPU types and current prices.
@callback run_job(ExAtlas.Spec.JobRequest.t(), ctx()) :: result(ExAtlas.Spec.Job.t())
Submit a serverless job. Returns {:error, :unsupported} if the provider has no serverless.
@callback spawn_compute(ExAtlas.Spec.ComputeRequest.t(), ctx()) :: result(ExAtlas.Spec.Compute.t())
Provision a compute resource from a normalized ComputeRequest.
Resume a previously stopped resource.
Stop a resource without destroying its storage (resume-able).
@callback stream_job(id(), ctx()) :: Enumerable.t()
Stream intermediate outputs for a job. Returns a lazy Enumerable.
Destroy a resource and its ephemeral storage.