All notable changes to this project will be documented in this file.

The format follows Keep a Changelog and ExAtlas adheres to Semantic Versioning.

v0.5.0 — unreleased

Closes all remaining audit items. Library is now at feature parity with the audit recommendations.

Added

  • ExAtlas.Fly.Supervisor (E3) — top-level supervisor for the Fly sub-tree, exposed as a child_spec/1 so hosts can embed ExAtlas under their own supervision tree. ExAtlas.Application delegates to its fly_children/0 to avoid duplication.
  • ExAtlas.Fly.Tokens.refresh/1 (E5) — atomic invalidate-then-acquire. Equivalent to invalidate/1 + get/1 but runs under a single GenServer call on the AppServer, closing the race where a concurrent caller acquires between the two.
  • ExAtlas.Fly.Dispatcher.subscribe_with_backpressure/2 (E6) — opt-in eviction watchdog. Monitors the subscriber's message queue and signals an eviction via {:ex_atlas_fly_backpressure_evict, topic} if the queue exceeds a configurable threshold.
  • Proactive soft-expiry refresh (E7) — ExAtlas.Fly.Tokens.AppServer schedules a background refresh :soft_expiry_lead_seconds (default 3600) before a cached token's expires_at. Avoids the expiry cliff where every caller around expiry hits the CLI at once.
  • Monorepo discovery (M4) — ExAtlas.Fly.Deploy.discover_apps/2 now accepts a :max_depth option. Default 1 preserves current behavior; set higher for apps/<name>/fly.toml layouts.
  • Streamer shutdown signal (L5) — the Streamer sends a final {:ex_atlas_fly_logs_stopped, app_name} on its topic when it terminates, so subscribers can unsubscribe themselves from the framework-agnostic dispatcher.

Changed

  • ExAtlas.Fly.Deploy.deploy/2 (M5) — now returns {:error, {:fly_error, :not_found, _}} when fly is not on PATH, matching stream_deploy/3. Previously raised ErlangError from System.cmd/3 on missing executables.
  • ExAtlas.Fly.Deploy.parse_app_name/1 (L3) — tightened regex: quoted values must not contain whitespace (pre-fix app = "my app" returned {:ok, "my"}). Still accepts unquoted values and whitespace-separated inline comments on the app = line.
  • ExAtlas.Fly.Logs.Streamer L7 race fix — until the first subscriber registers via subscribe_pid/2, the Streamer advances its cursor silently without dispatching. Previously the very first poll could fire before a caller's subscribe_pid/2, dropping the first batch onto a zero-subscriber topic.
  • ExAtlas.Fly.Logs.Streamer.subscribe/2 (L4) — project_dir is no longer required. New subscribe/2 arity takes keyword options only; subscribe/3 stays for backward compatibility with the old positional signature.
  • ExAtlas.Fly.Tokens.AppServer config resolution (M8, M9) — :fly_config_file_enabled and :cli_timeout_ms are now resolved once at AppServer init/1 rather than on every handle_call. Uses the consistent Keyword.get(config, :key, default) pattern.
  • ExAtlas.Fly.Tokens.AppServer structured logging (E4) — remaining Logger.warning interpolations for CLI failures now use metadata (app:, exit_code:, output:, timeout_ms:) instead of interpolated strings.
  • ExAtlas.Fly.TokenStorage.Dets mkdir fallback (M6) — when the explicitly-configured :storage_path is not writable, falls back to System.tmp_dir!/0 with a :warning log, rather than crashing on File.mkdir_p!/1. Previously only the default path had the fallback.
  • unlessif throughout deploy.ex (L1).
  • deploy/2 and stream_deploy/3 error shape typed explicitly (L2) — new ExAtlas.Fly.Deploy.deploy_error/0 type spec documents the three :fly_error reason variants (:not_found, :timeout, non_neg_integer()).

Installer

  • mix ex_atlas.install runtime.exs example (M2) — the post-install notice now includes a runtime.exs pattern for containerized deploys that want to override :storage_path via an environment variable.

Dispatcher docs (H7)

  • Added a subsection describing dispatch serialization semantics and pointing hosts with large fan-out at :phoenix_pubsub mode. The per-subscriber send/2 loop in :registry mode is documented as intentional for the typical log-streaming / deploy workload.

v0.4.1 — unreleased

Changed — Async token persist (closes audit H3)

  • ExAtlas.Fly.Tokens.AppServer now offloads cached-token storage writes to a supervised Task under a new ExAtlas.Fly.Tokens.TaskSupervisor child. The AppServer's handle_call replies as soon as ETS is updated; :dets.sync happens in the background.
  • Net effect: a slow storage write for one app no longer blocks that app's own subsequent token requests (and never blocked other apps', post-E1). Callers get the token with latency gated on ETS + cmd_fn only. Audit finding H3.
  • Manual-token persist stays synchronous. Manual tokens are not re-acquirable, so ExAtlas.Fly.Tokens.set_manual/2 still returns {:error, {:persist_failed, reason}} when storage raises — the caller must know if persist failed.
  • Persist failures on the cached path continue to log at :error level with {app, reason} metadata, now emitted from the task rather than the mailbox (contract preserved, emission point moved).

Added

v0.4.0 — unreleased

Changed — Per-app Fly tokens (audit E1; closes H3, H4)

  • Replaced the singleton ExAtlas.Fly.Tokens.Server with a per-app ExAtlas.Fly.Tokens.AppServer supervised under ExAtlas.Fly.Tokens.Supervisor. Token resolution for one app no longer blocks resolution for any other. A thundering herd of CLI acquisitions (e.g. post-VM-restart across N apps) now runs in parallel rather than serialized behind a single mailbox.
  • ExAtlas.Fly.Tokens.Server is removed. The documented public API (ExAtlas.Fly.Tokens.{get/1, invalidate/1, set_manual/2}) is unchanged and remains the stable entry point.
  • Shared ETS table (:ex_atlas_fly_tokens) is now :public and owned by ExAtlas.Fly.Tokens.ETSOwner, outliving individual AppServer crashes. A crashed AppServer restarts with its cache intact; an ETSOwner crash rebuilds the whole tokens subtree via :rest_for_one (Registry survives, DynamicSupervisor + every AppServer restart).
  • Concurrent Tokens.get/1 calls for the same app coalesce at the AppServer mailbox — only the first-in-line caller invokes the CLI; subsequent callers re-check ETS (filled by the first) before descending the resolution chain.

Added

  • [:ex_atlas, :fly, :token, :acquire] :stop metadata gains a new :acquirer field — :facade for pure ETS fast-path hits (no AppServer consulted) or :app_server for slow-path / coalesced resolutions. Existing handlers that match only on :source are unaffected. See guides/telemetry.md for the diagnostic interpretation.
  • ExAtlas.Fly.Tokens.Supervisor.whereis_app_server/2 and resolve_app_server/2 — lookup / resolve-or-start helpers. Primarily for tests.

v0.3.1 — 2026-04-22

Added — Telemetry for Fly platform ops

  • [:ex_atlas, :fly, :token, :acquire] span events around every ExAtlas.Fly.Tokens.get/1 call. :stop metadata includes source: (:ets / :storage / :config / :cli / :manual / :none) so operators can measure cache-hit rate and acquisition-path latency.
  • [:ex_atlas, :fly, :logs, :fetch] span events around ExAtlas.Fly.Logs.Client.fetch_logs/3. Metadata: {app, status, count}. Inherited automatically by fetch_logs_with_retry/2.
  • [:ex_atlas, :fly, :deploy, :line] (one per non-empty output line) + [:ex_atlas, :fly, :deploy, :exit] (one per deploy termination) from Deploy.stream_deploy/3. Line content is deliberately excluded — Fly build output can contain bearer tokens.

See guides/telemetry.md for the full event reference.

Added — Shared TokenStorage conformance suite

  • ExAtlas.Fly.TokenStorageConformance — a use-able ExUnit macro that any TokenStorage implementation can adopt to inherit the full get/put/delete contract coverage across :cached and :manual keys. Mirrors the existing ExAtlas.Test.ProviderConformance pattern.
  • Memory and Dets both run under the shared suite now, so any future adapter (Redis, Postgres, vault) can prove parity with one use line.

v0.3.0 — unreleased

Changed — Fly token / streamer return contracts

  • ExAtlas.Fly.Tokens.set_manual/2 (and Tokens.Server.set_manual_token/3) now return :ok | {:error, {:persist_failed, reason}} instead of always :ok. Manual tokens are not re-acquirable, so storage failures must be surfaced rather than silently logged. Callers that pattern-match on :ok should handle the error tuple.
  • ExAtlas.Fly.subscribe_logs/3 (and Streamer.subscribe/3) now return :ok | {:error, :no_streamer} when no streamer can be resolved (e.g. the Fly sub-tree is disabled). Previously this case returned a silent :ok with no messages ever arriving.

Fixed — Hardening round

  • ExAtlas.Fly.Tokens.Server persist/3 (cached path) now returns :ok | {:error, {:persist_failed, reason}} and logs failures at :error level with {app, reason} metadata instead of :warning with interpolated strings. ETS still holds a fresh token for the session, but a silent storage outage is now operator-visible.
  • ExAtlas.Fly.Dispatcher :mfa mode wraps the host MFA in try/rescue/catch so a raising MFA no longer takes down the caller (most commonly the log Streamer, whose crash drops the pagination cursor). Failures are logged at :error level with the topic and MFA identity.
  • ExAtlas.Fly.TokenStorage.Dets refuses to auto-recreate a corrupt manual.dets file on startup — manual tokens are bearer credentials that are NOT re-acquirable. Returns {:stop, {:manual_dets_corrupt, path, reason}} and preserves the file for operator intervention. The cached-token path still recreates (re-acquirable, perf regression only).
  • ExAtlas.Fly.TokenStorage.Dets now chmods the storage dir to 0700 and each DETS file to 0600 after open. Default umask on typical Linux/macOS left token files world- or group-readable.
  • mix ex_atlas.install surfaces .gitignore update failures as an Igniter.add_notice with the exact line the user must add manually; previously the installer silently swallowed the exception and moved on.
  • ExAtlas.Fly.TokenStorage.Memory (test support) now catches :exit from pre-init reads and returns :error, matching the Dets rescue ArgumentError semantics so the test double is faithful to prod.

Added

v0.2.0 — unreleased

Fixed

  • ExAtlas.Fly.Logs.Client.next_start_time/1 no longer crashes the Streamer when a log entry has a nil or malformed ISO-8601 timestamp; unparseable entries are logged and skipped.
  • ExAtlas.Fly.Deploy.stream_deploy/3 cleans both the activity and absolute timers symmetrically across all exit branches, so no stray {:deploy_*_timeout, _} message leaks into a long-lived caller's mailbox. Exposes :activity_timeout_ms / :max_timeout_ms options.
  • ExAtlas.Fly.Tokens.Server now implements terminate/2 to delete its named ETS table, avoiding an ArgumentError on supervisor restart, and defensively reclaims an existing table in init/1.
  • ExAtlas.Fly.Tokens.Server shuts down a hung fly CLI task with :brutal_kill so the configured cli_timeout_ms is actually the mailbox blocking time, not cli_timeout_ms + 5_000.
  • ExAtlas.Fly.Logs.StreamerSupervisor uses :rest_for_one with a generous restart budget on the DynamicSupervisor so one app's misbehaving streamer no longer tears down the registry and every other app's pagination cursor.

Added — Fly.io platform operations

  • ExAtlas.Fly top-level facade for Fly.io platform ops: discover_apps/1, deploy/2, stream_deploy/3, subscribe_logs/3, unsubscribe_logs/1, subscribe_deploy/1, unsubscribe_deploy/1.
  • ExAtlas.Fly.Deployfly deploy --remote-only with 15 min timeout (deploy/2) and Port-based streaming (stream_deploy/3) with a 5 min activity timer and 30 min absolute cap. Dispatches {:ex_atlas_fly_deploy, ticket_id, line} on each line.
  • ExAtlas.Fly.Logs.ClientReq-backed client for the Fly Machines log API (NDJSON, cursor pagination, automatic 401 retry).
  • ExAtlas.Fly.Logs.Streamer + StreamerSupervisor — per-app GenServer that polls the log API every 2 s, dispatches {:ex_atlas_fly_logs, app, entries}, and stops once all subscribers have disconnected (monitor-based).
  • ExAtlas.Fly.Tokens + ExAtlas.Fly.Tokens.Server — cache-first token resolver. Order: ETS → TokenStorage~/.fly/config.ymlfly tokens create readonly → manual override.
  • ExAtlas.Fly.TokenStorage — pluggable behaviour for durable token persistence. Default impl ExAtlas.Fly.TokenStorage.Dets is zero-config and survives VM restarts.
  • ExAtlas.Fly.Dispatcher — framework-agnostic broadcast. Modes: :registry (default, zero-dep), :phoenix_pubsub (when host uses Phoenix), or {:mfa, {m, f, a}} custom routing.
  • ExAtlas.Application now supervises the Fly sub-tree by default. Disable with config :ex_atlas, :fly, enabled: false.

Added — Igniter installer

  • mix ex_atlas.install — adds sensible config :ex_atlas, :fly defaults, creates the DETS storage directory, wires phoenix_pubsub when available.
  • mix ex_atlas.upgrade — runs per-version upgraders (no-op for 0.1.x → 0.2.0; reserved for future migrations).

Changed

  • Description and package scope broadened from "GPU/compute SDK" to "infrastructure SDK".
  • ExAtlas.Application's Fly sub-tree boots by default. The existing orchestrator sub-tree is still opt-in via start_orchestrator: true.

v0.1.0 — unreleased

Initial public release.

Added — Core API

  • ExAtlas top-level provider-agnostic module (spawn_compute/1, get_compute/2, list_compute/1, stop/2, start/2, terminate/2, run_job/1, get_job/2, cancel_job/2, stream_job/2, list_gpu_types/1, capabilities/1).
  • ExAtlas.Provider behaviour defining the contract every provider implements.
  • ExAtlas.Config — per-call > app-env > env-var resolution for provider and API key. Supports user-defined provider modules passed directly by name (no registration needed).
  • ExAtlas.Error — canonical error struct with :kind atoms (:unauthorized, :not_found, :rate_limited, :timeout, :unsupported, :validation, :provider, :transport, :unknown) and from_response/3 for translating HTTP responses.

Added — Normalized specs

Added — Providers

Added — Auth

  • ExAtlas.Auth.Token — cryptographically random 256-bit bearer tokens with SHA-256 hashing and constant-time comparison (Plug.Crypto).
  • ExAtlas.Auth.SignedUrl — S3-style HMAC-SHA256 signed URLs with expiry, for media streams and WebSockets that can't set headers.
  • Auto-injection: auth: :bearer on spawn_compute/1 mints a token, injects it into the pod as ATLAS_PRESHARED_KEY, and returns the handle in compute.auth.

Added — Orchestrator (opt-in)

Added — Phoenix LiveDashboard integration

  • ExAtlas.LiveDashboard.ComputePage — drop-in Phoenix.LiveDashboard.PageBuilder page. Host apps mount it via additional_pages: [atlas: ExAtlas.LiveDashboard.ComputePage]. Live table with Touch/Stop/Terminate row actions. Auto-refreshing; subscribes to ExAtlas.PubSub for push updates when available.
  • Guarded by Code.ensure_loaded?(Phoenix.LiveDashboard.PageBuilder) so the module only compiles when LiveDashboard is in the host app's deps.

Added — HTTP + observability

  • Every REST / runtime / GraphQL request goes through Req with :retry :transient, 3 retries by default, and telemetry.
  • Telemetry events [:ex_atlas, <provider>, :request] with %{status: status} measurements and %{api, method, url} metadata.
  • Per-call Req overrides via req_options:.

Added — Testing

  • ExAtlas.Test.ProviderConformance — shared ExUnit suite every provider implementation must pass. use-macro form accepts :reset MFA for test isolation.
  • Full unit coverage (68 tests, 3 doctests).

Added — Documentation

  • Comprehensive README.md with architecture diagram, capability matrix, GPU mapping table, error kinds, security considerations, FAQ, and roadmap.
  • guides/getting_started.md, guides/transient_pods.md, guides/writing_a_provider.md, guides/telemetry.md, guides/testing.md — long-form deep-dives surfaced via ex_doc extras.
  • Full module-level @moduledoc on every public module.