ExAtlas.Provider behaviour (ExAtlas v0.5.0)

Copy Markdown View Source

Behaviour every compute provider must implement.

A "provider" is any module that can spawn, control, and terminate GPU (or CPU) compute resources on some cloud. ExAtlas ships a full RunPod implementation and stubs for Fly.io Machines, Lambda Labs, and Vast.ai. Users can supply their own module — the top-level ExAtlas API accepts any module name as a :provider value, so in-house clouds or test doubles plug in without a PR.

Contract summary

All callbacks receive a ctx — a map holding the API key and any per-call overrides resolved by ExAtlas.Config. Callbacks return either a normalized struct (ExAtlas.Spec.Compute, ExAtlas.Spec.Job, ...) or a tagged error tuple shaped by ExAtlas.Error.

Capabilities

Not every provider supports every operation. capabilities/0 returns the list of atoms the provider honors (e.g. :serverless, :spot, :http_proxy). Callers that depend on an optional feature should check capabilities first rather than catching {:error, %ExAtlas.Error{kind: :unsupported}}.

Writing your own provider

defmodule MyCloud.Provider do
  @behaviour ExAtlas.Provider

  @impl true
  def spawn_compute(%ExAtlas.Spec.ComputeRequest{} = req, ctx) do
    # translate `req` into MyCloud's native payload and POST it
  end

  @impl true
  def capabilities, do: [:http_proxy]

  # ... all other callbacks ...
end

# Use it
ExAtlas.spawn_compute([provider: MyCloud.Provider, gpu: :a100_80g, ...])

Summary

Callbacks

Cancel an in-flight job.

List the capabilities the provider honors. Examples

Fetch the current state of a resource by provider id.

Fetch a job's status by id.

List resources; providers should honor at minimum :status and :name filters.

Return the provider's catalog of GPU types and current prices.

Submit a serverless job. Returns {:error, :unsupported} if the provider has no serverless.

Provision a compute resource from a normalized ComputeRequest.

Resume a previously stopped resource.

Stop a resource without destroying its storage (resume-able).

Stream intermediate outputs for a job. Returns a lazy Enumerable.

Destroy a resource and its ephemeral storage.

Types

ctx()

@type ctx() :: %{
  :api_key => String.t() | nil,
  :provider => atom(),
  optional(:base_url) => String.t(),
  optional(:req_options) => keyword(),
  optional(atom()) => term()
}

id()

@type id() :: String.t()

result(t)

@type result(t) :: {:ok, t} | {:error, ExAtlas.Error.t() | term()}

Callbacks

cancel_job(id, ctx)

(optional)
@callback cancel_job(id(), ctx()) :: :ok | {:error, term()}

Cancel an in-flight job.

capabilities()

@callback capabilities() :: [atom()]

List the capabilities the provider honors. Examples:

  • :spot — can rent interruptible instances
  • :serverless — supports run_job/2
  • :network_volumes — can attach persistent storage
  • :http_proxy — auto-terminated HTTPS proxy per pod
  • :raw_tcp — public IP + mapped TCP ports
  • :symmetric_ports — inside-port == outside-port guarantee
  • :webhooks — push completion callbacks
  • :global_networking — private networking across datacenters

get_compute(id, ctx)

@callback get_compute(id(), ctx()) :: result(ExAtlas.Spec.Compute.t())

Fetch the current state of a resource by provider id.

get_job(id, ctx)

(optional)
@callback get_job(id(), ctx()) :: result(ExAtlas.Spec.Job.t())

Fetch a job's status by id.

list_compute(keyword, ctx)

@callback list_compute(
  keyword(),
  ctx()
) :: result([ExAtlas.Spec.Compute.t()])

List resources; providers should honor at minimum :status and :name filters.

list_gpu_types(ctx)

(optional)
@callback list_gpu_types(ctx()) :: result([ExAtlas.Spec.GpuType.t()])

Return the provider's catalog of GPU types and current prices.

run_job(t, ctx)

(optional)
@callback run_job(ExAtlas.Spec.JobRequest.t(), ctx()) :: result(ExAtlas.Spec.Job.t())

Submit a serverless job. Returns {:error, :unsupported} if the provider has no serverless.

spawn_compute(t, ctx)

@callback spawn_compute(ExAtlas.Spec.ComputeRequest.t(), ctx()) ::
  result(ExAtlas.Spec.Compute.t())

Provision a compute resource from a normalized ComputeRequest.

start(id, ctx)

@callback start(id(), ctx()) :: :ok | {:error, term()}

Resume a previously stopped resource.

stop(id, ctx)

@callback stop(id(), ctx()) :: :ok | {:error, term()}

Stop a resource without destroying its storage (resume-able).

stream_job(id, ctx)

(optional)
@callback stream_job(id(), ctx()) :: Enumerable.t()

Stream intermediate outputs for a job. Returns a lazy Enumerable.

terminate(id, ctx)

@callback terminate(id(), ctx()) :: :ok | {:error, term()}

Destroy a resource and its ephemeral storage.