Hex.pm Docs License

A composable, pluggable Elixir SDK for infrastructure management. Two concerns under one roof:

  1. GPU / CPU compute across cloud providers. Spawn pods, run serverless inference, orchestrate transient per-user GPU sessions. Swap providers by changing one option.
  2. Fly.io platform operations. First-class deploys, log streaming, and token lifecycle — independent of the compute pipeline. See ExAtlas.Fly and the Fly guide.
  • One contract, many providers. ExAtlas.Provider is a behaviour; swap :runpod, :fly, :lambda_labs, :vast, or your own module without changing call sites.
  • Fly.io platform ops. ExAtlas.Fly.* handles fly deploy streaming, log tailing, and the full token resolution chain (ETS → DETS → ~/.fly/config.ymlfly tokens create). Works without Phoenix.
  • Batteries-included orchestration. Registry + DynamicSupervisor
  • Igniter installer. mix igniter.install ex_atlas wires everything up.
  • Built for the S3-style handoff. ExAtlas.Auth mints bearer tokens and S3-style HMAC-signed URLs so your browser can talk directly to a pod without the Phoenix app proxying every frame.
  • Pure Req under the hood. Every HTTP call goes through Req, so you get retries, decoding, and telemetry for free.
  • LiveDashboard included. Drop ExAtlas.LiveDashboard.ComputePage into your existing dashboard and get a live ops view of every tracked pod.

Table of contents


Installation

The one-liner — uses the Igniter installer to add the dep, write sensible config, and create storage directories:

mix igniter.install ex_atlas

Or add manually to mix.exs:

def deps do
  [
    {:ex_atlas, "~> 0.2"}
  ]
end

…then run mix ex_atlas.install once to wire config defaults, or configure things yourself (see Configuration).

For the optional orchestrator + LiveDashboard features, also include:

{:phoenix_pubsub, "~> 2.1"},           # PubSub broadcasts from the orchestrator
{:phoenix_live_dashboard, "~> 0.8"}    # ExAtlas.LiveDashboard.ComputePage tab

ExAtlas declares both as optional: true, so they are not pulled into pure library consumers.

Upgrading

To upgrade atlas and run any version-specific migrations:

mix deps.update atlas
mix ex_atlas.upgrade

The upgrade task is idempotent and runs only the steps needed between your previous and current atlas version.

Architecture at a glance


  ExAtlas (top-level provider-agnostic API)                              
  ExAtlas.spawn_compute/1 · run_job/2 · stream_job/1 · terminate/1       

                            
                
              ExAtlas.Provider (behaviour)     ExAtlas.Spec.*     
                  normalized structs
                                                
    
                                                     
     
 RunPod  Fly     Lambda Labs     Vast      Mock (test)
  v0.1   v0.2       v0.2         v0.3        v0.1     
     


  ExAtlas.Orchestrator (opt-in supervision tree)                         
  ComputeServer (GenServer/resource) · Registry · DynamicSupervisor    
  · Reaper · PubSub events                                             



  ExAtlas.Auth                                                           
  Token (bearer mint/verify) · SignedUrl (S3-style HMAC)               



  ExAtlas.LiveDashboard.ComputePage                                      
  Live-refreshing table · per-row Touch/Stop/Terminate                 

Quick start — Fly.io platform ops

ExAtlas gives you a clean Elixir API over fly deploy, the Fly Machines log API, and Fly token lifecycle. Works with or without Phoenix.

Discover apps

ExAtlas.Fly.discover_apps("/path/to/project")
# => [{"my-api", "/path/to/project"}, {"my-web", "/path/to/project/web"}]

Tail logs

ExAtlas.Fly.subscribe_logs("my-api", "/path/to/project")

# In the subscriber:
def handle_info({:ex_atlas_fly_logs, "my-api", entries}, state) do
  # entries :: [ExAtlas.Fly.Logs.LogEntry.t()]
  ...
end

A single streamer runs per app regardless of subscriber count, and stops once all subscribers disconnect. Automatic 401 retry is built in.

Stream a deploy

ExAtlas.Fly.subscribe_deploy(ticket_id)
Task.start(fn ->
  ExAtlas.Fly.stream_deploy(project_path, "web", ticket_id)
end)

def handle_info({:ex_atlas_fly_deploy, ^ticket_id, line}, state) do
  ...
end

Deploys are guarded by a 5 min activity timer (resets on output) and a 30 min absolute cap.

Tokens

ExAtlas.Fly.Tokens resolves tokens via ETS → DETS (durable) → ~/.fly/config.ymlfly tokens create readonly → manual override. You usually don't call it directly — the log client uses it transparently — but you can:

{:ok, token} = ExAtlas.Fly.Tokens.get("my-api")
ExAtlas.Fly.Tokens.invalidate("my-api")
ExAtlas.Fly.Tokens.set_manual("my-api", "fo1_...")

Full docs: Fly guide.

Quick start — transient per-user GPU pod

The motivating use case: a Fly.io-hosted Phoenix app spawns a RunPod GPU per user, hands the browser a preshared key, the browser runs real-time video inference directly against the pod, and ExAtlas reaps the pod when the session ends or goes idle.

# config/config.exs
config :ex_atlas, default_provider: :runpod
config :ex_atlas, :runpod, api_key: System.get_env("RUNPOD_API_KEY")
config :ex_atlas, start_orchestrator: true
# LiveView.mount/3
{:ok, pid, compute} =
  ExAtlas.Orchestrator.spawn(
    gpu: :h100,
    image: "ghcr.io/me/my-inference-server:latest",
    ports: [{8000, :http}],
    auth: :bearer,
    user_id: socket.assigns.current_user.id,
    idle_ttl_ms: 15 * 60_000,
    name: "atlas-" <> to_string(socket.assigns.current_user.id)
  )

Phoenix.PubSub.subscribe(ExAtlas.PubSub, "compute:" <> compute.id)

assign(socket,
  inference_url: hd(compute.ports).url,       # https://<pod-id>-8000.proxy.runpod.net
  inference_token: compute.auth.token         # handed straight to the browser
)

Inside the inference server running in the pod:

# Any request from the browser must carry the preshared key.
def authenticated?(conn) do
  preshared = System.fetch_env!("ATLAS_PRESHARED_KEY")

  case Plug.Conn.get_req_header(conn, "authorization") do
    ["Bearer " <> token] -> Plug.Crypto.secure_compare(token, preshared)
    _ -> false
  end
end

Heartbeat while the browser is active:

ExAtlas.Orchestrator.touch(compute.id)

When the user leaves, or after idle_ttl_ms with no heartbeat, the ComputeServer shuts down and terminates the upstream pod automatically. You can also terminate manually:

:ok = ExAtlas.Orchestrator.stop_tracked(compute.id)

Quick start — serverless inference

{:ok, job} =
  ExAtlas.run_job(
    provider: :runpod,
    endpoint: "abc123",
    input: %{prompt: "a beautiful sunset"},
    mode: :async
  )

{:ok, done} = ExAtlas.get_job(job.id, provider: :runpod, endpoint: "abc123")
done.output

# Synchronous with a hard timeout (wrapped in Task.async + Task.yield internally)
{:ok, done} =
  ExAtlas.run_job(
    provider: :runpod,
    endpoint: "abc123",
    input: %{prompt: "a beautiful sunset"},
    mode: :sync,
    timeout_ms: 60_000
  )

# Stream partial output
ExAtlas.stream_job(job.id, provider: :runpod, endpoint: "abc123")
|> Enum.each(&IO.inspect/1)

Swapping providers

# Today
ExAtlas.spawn_compute(provider: :runpod,      gpu: :h100, image: "...")

# v0.2
ExAtlas.spawn_compute(provider: :fly,         gpu: :a100_80g, image: "...")
ExAtlas.spawn_compute(provider: :lambda_labs, gpu: :h100, image: "...")

# v0.3
ExAtlas.spawn_compute(provider: :vast,        gpu: :rtx_4090, image: "...")

# Your in-house cloud, today:
ExAtlas.spawn_compute(provider: MyCompany.Cloud.Provider, gpu: :h100, image: "...")

All built-in and user-defined providers implement ExAtlas.Provider.

Configuration

# config/config.exs

# Provider resolution: per-call :provider option > :default_provider > raise
config :ex_atlas, default_provider: :runpod

# API keys: per-call :api_key > :ex_atlas / :<provider> config > env var
config :ex_atlas, :runpod,      api_key: System.get_env("RUNPOD_API_KEY")
config :ex_atlas, :fly,         api_key: System.get_env("FLY_API_TOKEN")
config :ex_atlas, :lambda_labs, api_key: System.get_env("LAMBDA_LABS_API_KEY")
config :ex_atlas, :vast,        api_key: System.get_env("VAST_API_KEY")

# Start the orchestrator (Registry + DynamicSupervisor + PubSub + Reaper).
# When false (default), ExAtlas boots no processes.
config :ex_atlas, start_orchestrator: true

# Reaper: periodic orphan reconciliation and idle-TTL enforcement.
config :ex_atlas, :orchestrator,
  reap_interval_ms: 60_000,
  reap_providers: [:runpod],
  reap_name_prefix: "atlas-"     # safety switch: only reap resources ExAtlas spawned

Default environment variable names used when nothing else is set:

ProviderEnv var
:runpodRUNPOD_API_KEY
:flyFLY_API_TOKEN
:lambda_labsLAMBDA_LABS_API_KEY
:vastVAST_API_KEY

Providers

ProviderModuleVersion shippedCapabilities
:runpodExAtlas.Providers.RunPodv0.1:spot, :serverless, :network_volumes, :http_proxy, :raw_tcp, :symmetric_ports, :webhooks, :global_networking
:flyExAtlas.Providers.Flyv0.2 (stub):http_proxy, :raw_tcp, :global_networking
:lambda_labsExAtlas.Providers.LambdaLabsv0.2 (stub):raw_tcp
:vastExAtlas.Providers.Vastv0.3 (stub):spot, :raw_tcp
:mockExAtlas.Providers.Mockv0.1 (tests):spot, :serverless, :network_volumes, :http_proxy, :raw_tcp, :webhooks

Stub modules return {:error, %ExAtlas.Error{kind: :unsupported}} from every non-capabilities/0 callback so the name is reserved and callers get a clear error — no FunctionClauseErrors.

Canonical GPU atoms

ExAtlas refers to GPUs by stable atoms. ExAtlas.Spec.GpuCatalog maps each atom to each provider's native identifier.

CanonicalRunPodLambda LabsFly.ioVast.ai
:h200"NVIDIA H200""H200"
:h100"NVIDIA H100 80GB HBM3""gpu_1x_h100_pcie""H100"
:a100_80g"NVIDIA A100 80GB PCIe""gpu_1x_a100_sxm4_80gb""a100-80gb""A100_80GB"
:a100_40g"NVIDIA A100-SXM4-40GB""gpu_1x_a100_sxm4""a100-pcie-40gb""A100"
:l40s"NVIDIA L40S""l40s"
:l4"NVIDIA L4"
:a6000"NVIDIA RTX A6000""gpu_1x_a6000""RTX_A6000"
:rtx_4090"NVIDIA GeForce RTX 4090""RTX_4090"
:rtx_3090"NVIDIA GeForce RTX 3090""RTX_3090"
:mi300x"AMD Instinct MI300X OAM"

See ExAtlas.Spec.GpuCatalog for the full mapping.

The ExAtlas.Provider behaviour

Every provider implements one callback per operation. See ExAtlas.Provider for the full contract.

CallbackPurpose
spawn_compute/2Provision a GPU/CPU resource
get_compute/2Fetch current status
list_compute/2List with optional filters
stop/2 / start/2Pause / resume
terminate/2Destroy
run_job/2Submit a serverless job
get_job/2 / cancel_job/2Job control
stream_job/2Stream partial outputs
capabilities/0Declare supported features
list_gpu_types/1Catalog + pricing

Callers can check ExAtlas.capabilities(:runpod) before relying on an optional feature:

if :serverless in ExAtlas.capabilities(provider) do
  ExAtlas.run_job(provider: provider, endpoint: "...", input: %{...})
end

Capability atoms

AtomMeaning
:spotInterruptible/spot instances
:serverlessrun_job/2 and friends
:network_volumesAttach persistent volumes
:http_proxyProvider terminates TLS on a *.proxy.* hostname
:raw_tcpPublic IP + mapped TCP ports
:symmetric_portsinternal == external port guarantee
:webhooksPush completion callbacks
:global_networkingPrivate networking across datacenters

Normalized specs (ExAtlas.Spec.*)

Requests and responses flow through normalized structs so callers don't have to know each provider's native shape.

Every spec struct has a :raw field preserving the provider's native response for callers who need fields ExAtlas hasn't yet normalized.

The :provider_opts field on request structs is the escape hatch for provider-specific options ExAtlas doesn't model — values are stringified and merged into the outgoing REST body.

Auth primitives

ExAtlas.Auth.Token and ExAtlas.Auth.SignedUrl are exposed directly if you want them without the rest of the orchestration layer.

Bearer tokens

mint = ExAtlas.Auth.Token.mint()
# %{
#   token: "kX9fP...",                              # hand to client once
#   hash:  "4c1...",                                # persist this
#   header: "Authorization: Bearer kX9fP...",
#   env:   %{"ATLAS_PRESHARED_KEY" => "kX9fP..."}   # inject into the pod
# }

ExAtlas.Auth.Token.valid?(candidate, mint.hash)

When you pass auth: :bearer to spawn_compute/1, ExAtlas mints a token, adds it to the pod's env as ATLAS_PRESHARED_KEY, and returns the handle in compute.auth — all in one round-trip.

S3-style signed URLs

For <video src>, <img src>, or any client that can't set request headers:

url =
  ExAtlas.Auth.SignedUrl.sign(
    "https://pod-id-8000.proxy.runpod.net/stream",
    secret: signing_secret,
    expires_in: 3600
  )

:ok = ExAtlas.Auth.SignedUrl.verify(url, secret: signing_secret)

The signature covers the path + canonicalized query + expiry with HMAC-SHA256; verification uses constant-time comparison.

Orchestrator — lifecycle, events, reaper

ExAtlas.Orchestrator.spawn/1

Spawns the resource via the provider, then starts an ExAtlas.Orchestrator.ComputeServer under ExAtlas.Orchestrator.ComputeSupervisor that:

  1. Registers itself in ExAtlas.Orchestrator.ComputeRegistry under {:compute, id}.
  2. Traps exits — its terminate/2 always calls ExAtlas.terminate/2 on the upstream provider, whether the supervisor shuts it down or it exits on an idle timeout.
  3. Tracks :last_activity_ms and compares against :idle_ttl_ms on every heartbeat tick. If idle, the server stops normally and the upstream resource is destroyed.

PubSub events

Every state change is broadcast over ExAtlas.PubSub on the topic "compute:<id>" as {:atlas_compute, id, event}:

EventEmitted when
{:status, :running}ComputeServer starts
{:heartbeat, monotonic_ms}Heartbeat tick (no idle timeout)
{:terminating, reason}Server is about to shut down
{:status, :terminated}Upstream provider confirmed termination
{:terminate_failed, error}Upstream terminate call returned an error

Subscribe in a LiveView:

Phoenix.PubSub.subscribe(ExAtlas.PubSub, "compute:" <> compute.id)

def handle_info({:atlas_compute, _id, {:status, :terminated}}, socket) do
  {:noreply, put_flash(socket, :info, "Session ended")}
end

Reaper

ExAtlas.Orchestrator.Reaper runs periodically (configurable, default 60s) and:

  1. Lists each configured provider's running resources.
  2. Compares against the resources tracked by the local ComputeRegistry.
  3. Terminates any orphan whose :name starts with :reap_name_prefix (default "atlas-").

The prefix is a safety switch so ExAtlas never touches pods created by other tools on the same cloud account. Set it to "" to disable.

Phoenix LiveDashboard integration

If your Phoenix app already mounts Phoenix.LiveDashboard, adding an ExAtlas tab is a one-liner — the library ships ExAtlas.LiveDashboard.ComputePage:

# lib/my_app_web/router.ex
import Phoenix.LiveDashboard.Router

live_dashboard "/dashboard",
  metrics: MyAppWeb.Telemetry,
  allow_destructive_actions: true,   # required for Stop/Terminate buttons
  additional_pages: [
    atlas: ExAtlas.LiveDashboard.ComputePage
  ]

Visit /dashboard/atlas to see a live-refreshing table of every tracked compute resource with per-row Touch, Stop, and Terminate controls. The page is only compiled when :phoenix_live_dashboard is in your deps (both LiveDashboard and LiveView are declared as optional: true in ExAtlas, so library-only users pay nothing).

HTTP layer + telemetry

Every provider uses Req under the hood:

  • Authorization: Bearer <api_key> for REST and serverless runtime endpoints.
  • ?api_key=<key> query param for RunPod's legacy GraphQL (used only for the pricing catalog).
  • :retry :transient with 3 retries by default.
  • Connection pooling via Finch (Req's default adapter).

Telemetry events

Every request emits [:ex_atlas, <provider>, :request]:

MeasurementValue
statusHTTP status code
MetadataValue
api:management / :runtime / :graphql
method:get / :post / :delete / ...
urlFull request URL

Wire into your existing telemetry pipeline:

:telemetry.attach(
  "atlas-http-logger",
  [:ex_atlas, :runpod, :request],
  fn _event, measurements, metadata, _ ->
    Logger.info("ExAtlas → RunPod #{metadata.method} #{metadata.url}#{measurements.status}")
  end,
  nil
)

Per-call Req overrides

Any option accepted by Req.new/1 can be passed via req_options::

ExAtlas.spawn_compute(
  provider: :runpod,
  gpu: :h100,
  image: "...",
  req_options: [receive_timeout: 60_000, max_retries: 5, plug: MyPlug]
)

Error handling

All provider callbacks return {:ok, value} or {:error, %ExAtlas.Error{}}. The error struct has a stable :kind atom you can pattern-match on:

KindWhen it happens
:unauthorizedBad or missing API key (HTTP 401)
:forbiddenAPI key lacks permission (HTTP 403)
:not_foundResource doesn't exist (HTTP 404)
:rate_limitedProvider 429
:timeoutClient-side timeout (e.g. run_sync over cap)
:unsupportedProvider lacks this capability
:validationExAtlas-side validation (e.g. missing :endpoint)
:providerProvider-reported 4xx/5xx with no finer bucket
:transportHTTP/socket failure
:unknownAnything else
case ExAtlas.spawn_compute(provider: :runpod, gpu: :h100, image: "...") do
  {:ok, compute} -> ...
  {:error, %ExAtlas.Error{kind: :unauthorized}} -> rotate_key()
  {:error, %ExAtlas.Error{kind: :rate_limited}} -> backoff()
  {:error, err} -> Logger.error(Exception.message(err))
end

Writing your own provider

defmodule MyCloud.Provider do
  @behaviour ExAtlas.Provider

  @impl true
  def capabilities, do: [:http_proxy]

  @impl true
  def spawn_compute(%ExAtlas.Spec.ComputeRequest{} = req, ctx) do
    # translate `req` into your cloud's native payload,
    # POST it with Req, normalize the response into %ExAtlas.Spec.Compute{}
  end

  # ... implement the other callbacks ...
end

# Use it without any further configuration:
ExAtlas.spawn_compute(provider: MyCloud.Provider, gpu: :h100, image: "...")

Register it with a short atom by mapping it in your own code — ExAtlas accepts modules directly, so the atom is a convenience:

defmodule MyApp.ExAtlas do
  defdelegate spawn_compute(opts), to: ExAtlas
  # Or wrap ExAtlas and inject a default provider module
end

Testing

The ExAtlas.Test.ProviderConformance macro runs a shared ExUnit suite against any provider implementation:

defmodule MyCloud.ProviderTest do
  use ExUnit.Case, async: false

  use ExAtlas.Test.ProviderConformance,
    provider: MyCloud.Provider,
    reset: {MyCloud.TestHelpers, :reset_fixtures, []}
end

For unit tests that don't actually talk to a cloud, use the built-in ExAtlas.Providers.Mock:

setup do
  ExAtlas.Providers.Mock.reset()
  :ok
end

test "my code is provider-agnostic" do
  {:ok, compute} = MyApp.do_work(provider: :mock)
  assert compute.status == :running
end

RunPod tests against the live cloud are tagged @tag :live and are excluded from mix test by default — set RUNPOD_API_KEY and run mix test --only live to enable them.

Security considerations

  • Preshared tokens are secrets. ExAtlas.Auth.Token.mint/0 returns the raw token once. Store only the hash. If you must persist the raw token (e.g. to render it back to the user on page reload), encrypt at rest.
  • allow_destructive_actions on the LiveDashboard route must be gated by your own auth pipeline. The ExAtlas page does not authenticate operators — LiveDashboard doesn't either. Put it behind :require_admin.
  • Reaper safety. :reap_name_prefix is the only thing preventing the reaper from terminating pods other tools (or other ExAtlas-using apps) own on the same cloud account. Keep the prefix unique per deployment.
  • Outbound egress. RunPod's *.proxy.runpod.net is world-reachable. If the pod inside doesn't validate ATLAS_PRESHARED_KEY on every request, anyone with the URL can hit it.
  • HTTPS only. Every provider's base URL is HTTPS. If you override via :base_url (for testing with Bypass), use HTTPS for production.

Troubleshooting & FAQ

Q: (RuntimeError) ExAtlas.Orchestrator is not started You didn't set config :ex_atlas, start_orchestrator: true. The orchestrator is opt-in.

Q: {:error, %ExAtlas.Error{kind: :unauthorized}} on every RunPod call Your API key is missing or wrong. Check the resolution order: per-call api_key:config :ex_atlas, :runpod, api_key:RUNPOD_API_KEY env var.

Q: get_job/2 returns {:error, :validation, message: "requires :endpoint"} RunPod's serverless API is scoped to an endpoint id. Pass it: ExAtlas.get_job(job.id, provider: :runpod, endpoint: "abc123").

Q: My LiveDashboard ExAtlas tab is empty. Either the orchestrator isn't running, or nothing has been spawned with ExAtlas.Orchestrator.spawn/1. Non-tracked resources (spawned via ExAtlas.spawn_compute/1 directly) don't show in the table — they're not under supervision.

Q: Stop/Terminate buttons don't show. Set allow_destructive_actions: true on the live_dashboard call.

Q: I want to use ExAtlas with httpc / Mint / Finch directly instead of Req. Rewrite the provider module, or pass a custom Req adapter via req_options: [adapter: my_adapter]. The ExAtlas.Provider contract doesn't mandate Req — it's an implementation choice of the bundled providers.

Roadmap

  • v0.1 — RunPod (full surface), Mock provider, orchestrator, auth, LiveDashboard page.
  • v0.2 — Fly.io Machines, Lambda Labs.
  • v0.3 — Vast.ai.

All future providers will be additive; adding a provider never breaks existing call sites.

Contributing

PRs welcome. Before opening:

mix format
mix compile --warnings-as-errors
mix test
mix docs              # verify docstrings render

For new providers, the shared conformance suite (test/support/provider_conformance.ex) must pass against your module.

License

Apache-2.0. See LICENSE.