All notable changes to this project will be documented in this file.
The format follows Keep a Changelog and ExAtlas adheres to Semantic Versioning.
v0.5.0 — unreleased
Closes all remaining audit items. Library is now at feature parity with the audit recommendations.
Added
ExAtlas.Fly.Supervisor(E3) — top-level supervisor for the Fly sub-tree, exposed as achild_spec/1so hosts can embed ExAtlas under their own supervision tree.ExAtlas.Applicationdelegates to itsfly_children/0to avoid duplication.ExAtlas.Fly.Tokens.refresh/1(E5) — atomic invalidate-then-acquire. Equivalent toinvalidate/1+get/1but runs under a single GenServer call on the AppServer, closing the race where a concurrent caller acquires between the two.ExAtlas.Fly.Dispatcher.subscribe_with_backpressure/2(E6) — opt-in eviction watchdog. Monitors the subscriber's message queue and signals an eviction via{:ex_atlas_fly_backpressure_evict, topic}if the queue exceeds a configurable threshold.- Proactive soft-expiry refresh (E7) —
ExAtlas.Fly.Tokens.AppServerschedules a background refresh:soft_expiry_lead_seconds(default 3600) before a cached token'sexpires_at. Avoids the expiry cliff where every caller around expiry hits the CLI at once. - Monorepo discovery (M4) —
ExAtlas.Fly.Deploy.discover_apps/2now accepts a:max_depthoption. Default1preserves current behavior; set higher forapps/<name>/fly.tomllayouts. - Streamer shutdown signal (L5) — the Streamer sends a final
{:ex_atlas_fly_logs_stopped, app_name}on its topic when it terminates, so subscribers can unsubscribe themselves from the framework-agnostic dispatcher.
Changed
ExAtlas.Fly.Deploy.deploy/2(M5) — now returns{:error, {:fly_error, :not_found, _}}whenflyis not onPATH, matchingstream_deploy/3. Previously raisedErlangErrorfromSystem.cmd/3on missing executables.ExAtlas.Fly.Deploy.parse_app_name/1(L3) — tightened regex: quoted values must not contain whitespace (pre-fixapp = "my app"returned{:ok, "my"}). Still accepts unquoted values and whitespace-separated inline comments on theapp =line.ExAtlas.Fly.Logs.StreamerL7 race fix — until the first subscriber registers viasubscribe_pid/2, the Streamer advances its cursor silently without dispatching. Previously the very first poll could fire before a caller'ssubscribe_pid/2, dropping the first batch onto a zero-subscriber topic.ExAtlas.Fly.Logs.Streamer.subscribe/2(L4) —project_diris no longer required. Newsubscribe/2arity takes keyword options only;subscribe/3stays for backward compatibility with the old positional signature.ExAtlas.Fly.Tokens.AppServerconfig resolution (M8, M9) —:fly_config_file_enabledand:cli_timeout_msare now resolved once at AppServerinit/1rather than on everyhandle_call. Uses the consistentKeyword.get(config, :key, default)pattern.ExAtlas.Fly.Tokens.AppServerstructured logging (E4) — remainingLogger.warninginterpolations for CLI failures now use metadata (app:,exit_code:,output:,timeout_ms:) instead of interpolated strings.ExAtlas.Fly.TokenStorage.Detsmkdir fallback (M6) — when the explicitly-configured:storage_pathis not writable, falls back toSystem.tmp_dir!/0with a:warninglog, rather than crashing onFile.mkdir_p!/1. Previously only the default path had the fallback.unless→ifthroughoutdeploy.ex(L1).deploy/2andstream_deploy/3error shape typed explicitly (L2) — newExAtlas.Fly.Deploy.deploy_error/0type spec documents the three:fly_errorreason variants (:not_found,:timeout,non_neg_integer()).
Installer
mix ex_atlas.installruntime.exs example (M2) — the post-install notice now includes aruntime.exspattern for containerized deploys that want to override:storage_pathvia an environment variable.
Dispatcher docs (H7)
- Added a subsection describing dispatch serialization semantics and
pointing hosts with large fan-out at
:phoenix_pubsubmode. The per-subscribersend/2loop in:registrymode is documented as intentional for the typical log-streaming / deploy workload.
v0.4.1 — unreleased
Changed — Async token persist (closes audit H3)
ExAtlas.Fly.Tokens.AppServernow offloads cached-token storage writes to a supervisedTaskunder a newExAtlas.Fly.Tokens.TaskSupervisorchild. The AppServer'shandle_callreplies as soon as ETS is updated;:dets.synchappens in the background.- Net effect: a slow storage write for one app no longer blocks that app's own subsequent token requests (and never blocked other apps', post-E1). Callers get the token with latency gated on ETS + cmd_fn only. Audit finding H3.
- Manual-token persist stays synchronous. Manual tokens are not
re-acquirable, so
ExAtlas.Fly.Tokens.set_manual/2still returns{:error, {:persist_failed, reason}}when storage raises — the caller must know if persist failed. - Persist failures on the cached path continue to log at
:errorlevel with{app, reason}metadata, now emitted from the task rather than the mailbox (contract preserved, emission point moved).
Added
ExAtlas.Fly.Tokens.TaskSupervisoris a new child ofExAtlas.Fly.Tokens.Supervisor, ordered afterETSOwnerand before theDynamicSupervisor. Tests can inject a custom name via:task_suponTokens.Supervisor.start_link/1.
v0.4.0 — unreleased
Changed — Per-app Fly tokens (audit E1; closes H3, H4)
- Replaced the singleton
ExAtlas.Fly.Tokens.Serverwith a per-appExAtlas.Fly.Tokens.AppServersupervised underExAtlas.Fly.Tokens.Supervisor. Token resolution for one app no longer blocks resolution for any other. A thundering herd of CLI acquisitions (e.g. post-VM-restart across N apps) now runs in parallel rather than serialized behind a single mailbox. ExAtlas.Fly.Tokens.Serveris removed. The documented public API (ExAtlas.Fly.Tokens.{get/1, invalidate/1, set_manual/2}) is unchanged and remains the stable entry point.- Shared ETS table (
:ex_atlas_fly_tokens) is now:publicand owned byExAtlas.Fly.Tokens.ETSOwner, outliving individual AppServer crashes. A crashed AppServer restarts with its cache intact; an ETSOwner crash rebuilds the whole tokens subtree via:rest_for_one(Registry survives, DynamicSupervisor + every AppServer restart). - Concurrent
Tokens.get/1calls for the same app coalesce at the AppServer mailbox — only the first-in-line caller invokes the CLI; subsequent callers re-check ETS (filled by the first) before descending the resolution chain.
Added
[:ex_atlas, :fly, :token, :acquire]:stopmetadata gains a new:acquirerfield —:facadefor pure ETS fast-path hits (no AppServer consulted) or:app_serverfor slow-path / coalesced resolutions. Existing handlers that match only on:sourceare unaffected. Seeguides/telemetry.mdfor the diagnostic interpretation.ExAtlas.Fly.Tokens.Supervisor.whereis_app_server/2andresolve_app_server/2— lookup / resolve-or-start helpers. Primarily for tests.
v0.3.1 — 2026-04-22
Added — Telemetry for Fly platform ops
[:ex_atlas, :fly, :token, :acquire]span events around everyExAtlas.Fly.Tokens.get/1call.:stopmetadata includessource:(:ets/:storage/:config/:cli/:manual/:none) so operators can measure cache-hit rate and acquisition-path latency.[:ex_atlas, :fly, :logs, :fetch]span events aroundExAtlas.Fly.Logs.Client.fetch_logs/3. Metadata:{app, status, count}. Inherited automatically byfetch_logs_with_retry/2.[:ex_atlas, :fly, :deploy, :line](one per non-empty output line) +[:ex_atlas, :fly, :deploy, :exit](one per deploy termination) fromDeploy.stream_deploy/3. Line content is deliberately excluded — Fly build output can contain bearer tokens.
See guides/telemetry.md for the full event reference.
Added — Shared TokenStorage conformance suite
ExAtlas.Fly.TokenStorageConformance— ause-able ExUnit macro that anyTokenStorageimplementation can adopt to inherit the fullget/put/deletecontract coverage across:cachedand:manualkeys. Mirrors the existingExAtlas.Test.ProviderConformancepattern.MemoryandDetsboth run under the shared suite now, so any future adapter (Redis, Postgres, vault) can prove parity with oneuseline.
v0.3.0 — unreleased
Changed — Fly token / streamer return contracts
ExAtlas.Fly.Tokens.set_manual/2(andTokens.Server.set_manual_token/3) now return:ok | {:error, {:persist_failed, reason}}instead of always:ok. Manual tokens are not re-acquirable, so storage failures must be surfaced rather than silently logged. Callers that pattern-match on:okshould handle the error tuple.ExAtlas.Fly.subscribe_logs/3(andStreamer.subscribe/3) now return:ok | {:error, :no_streamer}when no streamer can be resolved (e.g. the Fly sub-tree is disabled). Previously this case returned a silent:okwith no messages ever arriving.
Fixed — Hardening round
ExAtlas.Fly.Tokens.Serverpersist/3(cached path) now returns:ok | {:error, {:persist_failed, reason}}and logs failures at:errorlevel with{app, reason}metadata instead of:warningwith interpolated strings. ETS still holds a fresh token for the session, but a silent storage outage is now operator-visible.ExAtlas.Fly.Dispatcher:mfamode wraps the host MFA intry/rescue/catchso a raising MFA no longer takes down the caller (most commonly the log Streamer, whose crash drops the pagination cursor). Failures are logged at:errorlevel with the topic and MFA identity.ExAtlas.Fly.TokenStorage.Detsrefuses to auto-recreate a corruptmanual.detsfile on startup — manual tokens are bearer credentials that are NOT re-acquirable. Returns{:stop, {:manual_dets_corrupt, path, reason}}and preserves the file for operator intervention. The cached-token path still recreates (re-acquirable, perf regression only).ExAtlas.Fly.TokenStorage.Detsnowchmods the storage dir to0700and each DETS file to0600after open. Default umask on typical Linux/macOS left token files world- or group-readable.mix ex_atlas.installsurfaces.gitignoreupdate failures as anIgniter.add_noticewith the exact line the user must add manually; previously the installer silently swallowed the exception and moved on.ExAtlas.Fly.TokenStorage.Memory(test support) now catches:exitfrom pre-init reads and returns:error, matching the Detsrescue ArgumentErrorsemantics so the test double is faithful to prod.
Added
ExAtlas.Fly.TokenStorage.Dets.start_link/1accepts:name,:cached_table,:manual_tableopts so custom-supervised / per-test instances are possible alongside the default singleton.- First test coverage for
ExAtlas.Fly.Dispatcher,TokenStorage.Dets,TokenStorage.Memory, and theStreamer.subscribe/3silent-failure path.
v0.2.0 — unreleased
Fixed
ExAtlas.Fly.Logs.Client.next_start_time/1no longer crashes the Streamer when a log entry has anilor malformed ISO-8601 timestamp; unparseable entries are logged and skipped.ExAtlas.Fly.Deploy.stream_deploy/3cleans both the activity and absolute timers symmetrically across all exit branches, so no stray{:deploy_*_timeout, _}message leaks into a long-lived caller's mailbox. Exposes:activity_timeout_ms/:max_timeout_msoptions.ExAtlas.Fly.Tokens.Servernow implementsterminate/2to delete its named ETS table, avoiding anArgumentErroron supervisor restart, and defensively reclaims an existing table ininit/1.ExAtlas.Fly.Tokens.Servershuts down a hungflyCLI task with:brutal_killso the configuredcli_timeout_msis actually the mailbox blocking time, notcli_timeout_ms + 5_000.ExAtlas.Fly.Logs.StreamerSupervisoruses:rest_for_onewith a generous restart budget on theDynamicSupervisorso one app's misbehaving streamer no longer tears down the registry and every other app's pagination cursor.
Added — Fly.io platform operations
ExAtlas.Flytop-level facade for Fly.io platform ops:discover_apps/1,deploy/2,stream_deploy/3,subscribe_logs/3,unsubscribe_logs/1,subscribe_deploy/1,unsubscribe_deploy/1.ExAtlas.Fly.Deploy—fly deploy --remote-onlywith 15 min timeout (deploy/2) and Port-based streaming (stream_deploy/3) with a 5 min activity timer and 30 min absolute cap. Dispatches{:ex_atlas_fly_deploy, ticket_id, line}on each line.ExAtlas.Fly.Logs.Client—Req-backed client for the Fly Machines log API (NDJSON, cursor pagination, automatic 401 retry).ExAtlas.Fly.Logs.Streamer+StreamerSupervisor— per-app GenServer that polls the log API every 2 s, dispatches{:ex_atlas_fly_logs, app, entries}, and stops once all subscribers have disconnected (monitor-based).ExAtlas.Fly.Tokens+ExAtlas.Fly.Tokens.Server— cache-first token resolver. Order: ETS →TokenStorage→~/.fly/config.yml→fly tokens create readonly→ manual override.ExAtlas.Fly.TokenStorage— pluggable behaviour for durable token persistence. Default implExAtlas.Fly.TokenStorage.Detsis zero-config and survives VM restarts.ExAtlas.Fly.Dispatcher— framework-agnostic broadcast. Modes::registry(default, zero-dep),:phoenix_pubsub(when host uses Phoenix), or{:mfa, {m, f, a}}custom routing.ExAtlas.Applicationnow supervises the Fly sub-tree by default. Disable withconfig :ex_atlas, :fly, enabled: false.
Added — Igniter installer
mix ex_atlas.install— adds sensibleconfig :ex_atlas, :flydefaults, creates the DETS storage directory, wiresphoenix_pubsubwhen available.mix ex_atlas.upgrade— runs per-version upgraders (no-op for 0.1.x → 0.2.0; reserved for future migrations).
Changed
- Description and package scope broadened from "GPU/compute SDK" to "infrastructure SDK".
ExAtlas.Application's Fly sub-tree boots by default. The existing orchestrator sub-tree is still opt-in viastart_orchestrator: true.
v0.1.0 — unreleased
Initial public release.
Added — Core API
ExAtlastop-level provider-agnostic module (spawn_compute/1,get_compute/2,list_compute/1,stop/2,start/2,terminate/2,run_job/1,get_job/2,cancel_job/2,stream_job/2,list_gpu_types/1,capabilities/1).ExAtlas.Providerbehaviour defining the contract every provider implements.ExAtlas.Config— per-call > app-env > env-var resolution for provider and API key. Supports user-defined provider modules passed directly by name (no registration needed).ExAtlas.Error— canonical error struct with:kindatoms (:unauthorized,:not_found,:rate_limited,:timeout,:unsupported,:validation,:provider,:transport,:unknown) andfrom_response/3for translating HTTP responses.
Added — Normalized specs
ExAtlas.Spec.ComputeRequest— input tospawn_compute/1withNimbleOptions-validated fields,:provider_optsescape hatch.ExAtlas.Spec.Compute— normalized compute resource response.ExAtlas.Spec.JobRequest/ExAtlas.Spec.Job— serverless jobs.ExAtlas.Spec.GpuType— catalog entry with pricing + stock.ExAtlas.Spec.GpuCatalog— stable canonical GPU atoms (:h100,:a100_80g,:rtx_4090, ...) mapped to each provider's native identifier.
Added — Providers
ExAtlas.Providers.RunPod— full implementation covering REST management (pods, endpoints, templates, network volumes, billing), serverless runtime (async/sync/stream job submission, status, cancel), and the legacy GraphQL pricing catalog. Built onReq.- Sub-modules:
Client,GraphQL,Pods,Endpoints,Jobs,Templates,NetworkVolumes,Billing,Translate.
- Sub-modules:
ExAtlas.Providers.Mock— in-memory ETS-backed provider for tests and demos. Implements every callback.ExAtlas.Providers.Stubmacro — shared base for placeholder providers.ExAtlas.Providers.Fly,ExAtlas.Providers.LambdaLabs,ExAtlas.Providers.Vast— placeholder modules reserving atoms and capability lists for v0.2 / v0.3.
Added — Auth
ExAtlas.Auth.Token— cryptographically random 256-bit bearer tokens with SHA-256 hashing and constant-time comparison (Plug.Crypto).ExAtlas.Auth.SignedUrl— S3-style HMAC-SHA256 signed URLs with expiry, for media streams and WebSockets that can't set headers.- Auto-injection:
auth: :beareronspawn_compute/1mints a token, injects it into the pod asATLAS_PRESHARED_KEY, and returns the handle incompute.auth.
Added — Orchestrator (opt-in)
ExAtlas.Orchestrator— high-level API (spawn/1,touch/1,info/1,stop_tracked/1,list_ids/0).ExAtlas.Orchestrator.ComputeServer— one GenServer per tracked resource, traps exits, enforces:idle_ttl_ms, broadcasts state changes viaExAtlas.Orchestrator.Events.ExAtlas.Orchestrator.ComputeSupervisor(DynamicSupervisor) +ExAtlas.Orchestrator.ComputeRegistry(Registrywith:vialookup).ExAtlas.Orchestrator.Reaper— periodic reconciliation; terminates orphans whose:namematches the configurable safety-prefix.ExAtlas.Applicationstarts the tree only whenconfig :ex_atlas, start_orchestrator: true; library-only users pay nothing.- Phoenix.PubSub broadcasts on
"compute:<id>"topic as{:atlas_compute, id, event}for{:status, s},{:heartbeat, ms},{:terminating, reason},{:terminate_failed, err}events.
Added — Phoenix LiveDashboard integration
ExAtlas.LiveDashboard.ComputePage— drop-inPhoenix.LiveDashboard.PageBuilderpage. Host apps mount it viaadditional_pages: [atlas: ExAtlas.LiveDashboard.ComputePage]. Live table with Touch/Stop/Terminate row actions. Auto-refreshing; subscribes toExAtlas.PubSubfor push updates when available.- Guarded by
Code.ensure_loaded?(Phoenix.LiveDashboard.PageBuilder)so the module only compiles when LiveDashboard is in the host app's deps.
Added — HTTP + observability
- Every REST / runtime / GraphQL request goes through
Reqwith:retry :transient, 3 retries by default, and telemetry. - Telemetry events
[:ex_atlas, <provider>, :request]with%{status: status}measurements and%{api, method, url}metadata. - Per-call
Reqoverrides viareq_options:.
Added — Testing
ExAtlas.Test.ProviderConformance— shared ExUnit suite every provider implementation must pass.use-macro form accepts:resetMFA for test isolation.- Full unit coverage (68 tests, 3 doctests).
Added — Documentation
- Comprehensive
README.mdwith architecture diagram, capability matrix, GPU mapping table, error kinds, security considerations, FAQ, and roadmap. guides/getting_started.md,guides/transient_pods.md,guides/writing_a_provider.md,guides/telemetry.md,guides/testing.md— long-form deep-dives surfaced via ex_doc extras.- Full module-level
@moduledocon every public module.