Emily — architecture

Copy Markdown View Source

The current shape of the library, in a form that contributors can read without sifting milestone history. For per-milestone rationale see PLAN.md; for open work see ROADMAP.md.

Layers

Emily.Compiler    (Nx.Defn.Compiler)  validates opts, pins the result backend
Emily.Backend     (Nx.Backend)        op-by-op translation to Native
Emily.Native      (NIF shim)          one function per MLX op, no policy
Worker threads    (one OS thread per stream)
MLX C++                               statically linked into libemily; mlx.metallib alongside

Dispatch is unidirectional in the deadlock-class sense: every NIF enqueues its work on a worker and returns immediately. The worker posts its result back to the caller via enif_send — fire-and-forget, no synchronous return value — and the Elixir wrapper awaits the message with a plain receive. C++ never calls into Elixir code, and no NIF ever blocks on a BEAM operation. Each layer has its own oracle so a bug can only be introduced in the layer where its test fails — see Testing philosophy.

Core design decisions

  1. Backend-first; compiler layered on top. The Nx.Backend is enough to run Bumblebee. Emily.Compiler delegates the expression walk to Nx.Defn.Evaluator and adds two adjustments: pin the result backend to Emily.Backend via __to_backend__/1, and cap :max_concurrency at 1. mlx::core::compile is deliberately not wrapped — the fusion win on transformer-shaped workloads was measured below the 1.20× gate (see PLAN M6).

  2. Trace in Elixir, not in C++. Nx.Defn.Expr is already a fully traced tree; the compiler walks it from Elixir and emits one Native NIF call per node (lib/emily/native.ex).

  3. One resource type: Tensor wrapping mlx::array. MLX's refcount does the heavy lifting; fine's ResourcePtr adds one BEAM-managed ref. c_src/emily/tensor.hpp is the source of truth for the wrapper.

  4. Worker-thread dispatch. Every NIF enqueues its work on a dedicated OS thread (the worker) that owns one MLX stream and its Metal command encoder. NIFs return immediately after enqueueing; the worker posts {ref, {:ok, result}} back via enif_send and the public Elixir wrapper awaits it with a plain receive. No BEAM scheduler (regular or dirty) blocks on MLX work. Because the MLX stream is pinned to its worker thread, Metal's per-thread CommandEncoder state stays consistent regardless of how the BEAM migrates Elixir processes between schedulers.

  5. Default worker + per-process streams. The Emily.MlxStream.Default GenServer owns the default worker. Every op uses it unless the caller has installed a per-process worker via Emily.Stream.with_stream/2. Per-process workers are the recommended pattern for concurrent inference on a shared model — weights live in one MLX buffer, each worker reads from it independently.

  6. Cache compiled defn in the closure Nx.Defn.compile/3 returns, not in ETS or :persistent_term. Bumblebee and Nx.Serving hold that closure on warmup, so subsequent calls skip the walk. An external {mfa, input_signature} cache was prototyped and dropped — the per-call ETS deep-copy cost on a Qwen3-sized expression tree exceeded any reuse savings.

  7. No f64. Hard error at the Backend with a clear message pointing to f32. MLX has no f64 primitive on Metal; not worth working around. Same goes for {:f8_e4m3fn, 8} (introduced in Nx 0.11) — rejected at the boundary with an "no MLX primitive" ArgumentError.

  8. Error discipline. Every NIF catches C++ exceptions at the boundary and returns {:error, term}. Never unwind across enif_ calls. Async errors are annotated with op, input shapes/dtypes, options, and worker context — see the async helper module under lib/emily/native/async.ex.

  9. Zero-copy to_binary. Nx.to_binary/1 on an Emily tensor returns a BEAM resource binary aliasing the MLX buffer via enif_make_resource_binary; the resource retains a refcount on the mlx::array so the buffer survives until the BEAM binary is GC'd. from_binary retains its memcpy — MTL::newBufferWithBytesNoCopy requires page-aligned, page-sized memory that real-world inputs (safetensors, :file.pread) never provide.

Concurrency model

MLX dispatches GPU work through Metal command queues. Emily owns one worker thread per command queue. The default worker is shared across the VM; per-process workers (Emily.Stream) let multiple processes run inference concurrently on a shared model.

Three viable configurations for serving:

ConfigurationWeightsGPU queuesWhen to use
Single serving, default stream1 (shared)Default. Simplest; fine for single-user / batched.
Single serving + pool of Emily.StreamsNConcurrent inference on a shared model. Large models.
K servings (pooled), default stream1 (shared)Small models where CPU serving work dominates GPU.

The README's Concurrency model section has the worked code; this note is for the architecture map only.

Memory model

MLX buffers live outside the BEAM heap. Emily.Memory is the public allocator API:

  • stats/0 samples active / peak / cache bytes and emits the [:emily, :memory, :stats] telemetry event.
  • reset_peak/0 resets the high-water mark.
  • clear_cache/0 asks MLX to release cached reusable buffers — does not free live tensors. Tensors and resource binaries returned by Nx.to_binary/1 are released only after the owning BEAM references are garbage collected.

Emily.Telemetry.memory_stats/0 delegates here for back-compat; new code should call Emily.Memory directly.

Observability

Span events at the evaluation boundary:

  • [:emily, :eval, *] — every Emily.eval/1 and the implicit evaluation inside to_binary.
  • [:emily, :to_binary, *] — both Emily.to_binary/1 and the Nx.Backend.to_binary path. Metadata: :shape, :dtype, :byte_size.
  • [:emily, :fallback, *] — every Nx.BinaryBackend fallback entry. Metadata: :op, :input_shapes, :input_dtypes.
  • [:emily, :block, :fallback] — discrete event each time the backend's block callback (Nx 0.12+ Nx.Block.* dispatch) falls through to the supplied default fun.
  • [:emily, :memory, :stats] — discrete event from Emily.Memory.stats/0.

Span instrumentation deliberately stops at the evaluation boundary rather than wrapping every graph-construction call site in Emily.Backend: those NIFs are <10μs and do no work; the evaluation boundary is where MLX actually runs kernels.

Fallback behaviour is configured via config :emily, :fallback:

  • :silent (default) — only the telemetry event fires.
  • :warn — one-shot Logger.warning per {op, input_shapes} pair.
  • :raiseRuntimeError on fallback entry; CI-friendly.

The legacy config :emily, :warn_on_fallback, true boolean is still honoured when :fallback is unset.

Debug assertions

Two compile-time opt-in flags re-enable runtime checks that MLX (and every other GPU backend) skips by default. Both default false; the guarded branches are dead-code eliminated by the Elixir compiler, so the runtime cost when off is zero.

  • :debug_bounds_check — raises on out-of-range / negative indices in gather / take / take_along_axis / indexed_add / indexed_put.
  • :debug_detect_nan_inf — scans results of matmul, the fused layer_norm / rms_norm, and both fused SDPA variants.

Each check is a per-op MLX reduction plus a scalar readback — a worker sync that breaks lazy-graph fusion. Leave off in release builds.

Build & packaging

  • Hex consumers download a precompiled NIF (libemily.{so,dylib}
    • mlx.metallib) from the GitHub release for the pinned version, SHA256-verified against a .sha256 sidecar fetched alongside. No C++ toolchain or cmake required.
  • Contributors build from source: mix deps.get clones MLX into deps/mlx_src at the pinned tag, scripts/build-mlx.sh cmake-builds libmlx.a + mlx.metallib into $EMILY_CACHE/mlx-<version>-<variant>/, and elixir_make links the NIF against it.
  • MLX variant selection is via config :emily, variant: :aot | :jit. :aot is the default and works on macOS 14+; :jit ships smaller artefacts but requires macOS 26.2+ at runtime.
  • mix emily.doctor verifies the local install: host platform, active variant, native artefacts in priv/, NIF loadability, and a tiny Emily.Backend smoke test that asserts no silent fallback to Nx.BinaryBackend.

Testing philosophy

Each layer is tested against its own oracle. A bug can only be introduced in the layer where its test fails — no cross-layer mystery bugs.

LayerOracleHarness
NativeHand-computed expected valuesExUnit unit tests
BackendNx.BinaryBackend on the same inputsStreamData property tests + Nx conformance
CompilerEmily.Backend in non-defn modeEquivalence tests (same function, two modes)
GradNx.BinaryBackend grad + finite differences + EXLA CPUStreamData property tests + numerical oracle + EXLA golden
TrainingNx.BinaryBackend loss trajectoryCurve-matching; MNIST convergence (:training_full)
E2EHuggingFace Transformers reference slicesBumblebee conformance suites with cached weights

Soak harnesses (all under test/soak/, @tag :soak, opt-in):

  • memory_test — 10k iterations; MLX memory returns to baseline.
  • training_test — 1k training steps; baseline restored after Emily.Memory.clear_cache/0.
  • backend_concurrency_test / eval_concurrency_test / stream_concurrency_test — parallel inference under the default worker, the evaluation path, and per-process streams; determinism + no crashes.
  • backend_soak_test — broad backend exerciser; allocator drift over a large mixed-op workload.
  • quantized_memory_test — quantized-matmul loop, distinct allocator pattern from fp16 inference.
  • zero_copy_roundtrip_testto_binary aliases the MLX buffer rather than copying; tested via allocator-stats deltas.

Risks and mitigations

RiskMitigation
MLX op semantics drift from Nx expectationsProperty tests explicitly generate edge cases; document intentional divergences
Metal driver bugs in specific macOS versionsPin known-good macOS in CI; test matrix across 14/26
f16/bf16 accumulation differences from EXLATolerance-aware comparisons; document expected divergence
Upstream Nx API changes (Nx.Backend / Nx.Defn.Compiler etc)Version-pin Nx; coordinate with elixir-nx maintainers
MLX upstream API churn on source builds@mlx_version pin; audit on bump; mix emily.doctor surfaces toolchain mismatches