The current shape of the library, in a form that contributors can read
without sifting milestone history. For per-milestone rationale see
PLAN.md; for open work see ROADMAP.md.
Layers
Emily.Compiler (Nx.Defn.Compiler) — validates opts, pins the result backend
Emily.Backend (Nx.Backend) — op-by-op translation to Native
Emily.Native (NIF shim) — one function per MLX op, no policy
Worker threads (one OS thread per stream)
MLX C++ — statically linked into libemily; mlx.metallib alongsideDispatch is unidirectional in the deadlock-class sense: every NIF
enqueues its work on a worker and returns immediately. The worker
posts its result back to the caller via enif_send — fire-and-forget,
no synchronous return value — and the Elixir wrapper awaits the
message with a plain receive. C++ never calls into Elixir code, and
no NIF ever blocks on a BEAM operation. Each layer has its own oracle
so a bug can only be introduced in the layer where its test fails —
see Testing philosophy.
Core design decisions
Backend-first; compiler layered on top. The
Nx.Backendis enough to run Bumblebee.Emily.Compilerdelegates the expression walk toNx.Defn.Evaluatorand adds two adjustments: pin the result backend toEmily.Backendvia__to_backend__/1, and cap:max_concurrencyat 1.mlx::core::compileis deliberately not wrapped — the fusion win on transformer-shaped workloads was measured below the 1.20× gate (see PLAN M6).Trace in Elixir, not in C++.
Nx.Defn.Expris already a fully traced tree; the compiler walks it from Elixir and emits one Native NIF call per node (lib/emily/native.ex).One resource type:
Tensorwrappingmlx::array. MLX's refcount does the heavy lifting;fine'sResourcePtradds one BEAM-managed ref.c_src/emily/tensor.hppis the source of truth for the wrapper.Worker-thread dispatch. Every NIF enqueues its work on a dedicated OS thread (the worker) that owns one MLX stream and its Metal command encoder. NIFs return immediately after enqueueing; the worker posts
{ref, {:ok, result}}back viaenif_sendand the public Elixir wrapper awaits it with a plainreceive. No BEAM scheduler (regular or dirty) blocks on MLX work. Because the MLX stream is pinned to its worker thread, Metal's per-threadCommandEncoderstate stays consistent regardless of how the BEAM migrates Elixir processes between schedulers.Default worker + per-process streams. The
Emily.MlxStream.DefaultGenServer owns the default worker. Every op uses it unless the caller has installed a per-process worker viaEmily.Stream.with_stream/2. Per-process workers are the recommended pattern for concurrent inference on a shared model — weights live in one MLX buffer, each worker reads from it independently.Cache compiled defn in the closure
Nx.Defn.compile/3returns, not in ETS or:persistent_term. Bumblebee andNx.Servinghold that closure on warmup, so subsequent calls skip the walk. An external{mfa, input_signature}cache was prototyped and dropped — the per-call ETS deep-copy cost on a Qwen3-sized expression tree exceeded any reuse savings.No f64. Hard error at the Backend with a clear message pointing to f32. MLX has no f64 primitive on Metal; not worth working around. Same goes for
{:f8_e4m3fn, 8}(introduced in Nx 0.11) — rejected at the boundary with an "no MLX primitive"ArgumentError.Error discipline. Every NIF catches C++ exceptions at the boundary and returns
{:error, term}. Never unwind acrossenif_calls. Async errors are annotated with op, input shapes/dtypes, options, and worker context — see the async helper module underlib/emily/native/async.ex.Zero-copy
to_binary.Nx.to_binary/1on an Emily tensor returns a BEAM resource binary aliasing the MLX buffer viaenif_make_resource_binary; the resource retains a refcount on themlx::arrayso the buffer survives until the BEAM binary is GC'd.from_binaryretains its memcpy —MTL::newBufferWithBytesNoCopyrequires page-aligned, page-sized memory that real-world inputs (safetensors,:file.pread) never provide.
Concurrency model
MLX dispatches GPU work through Metal command queues. Emily owns one
worker thread per command queue. The default worker is shared across
the VM; per-process workers (Emily.Stream) let multiple processes
run inference concurrently on a shared model.
Three viable configurations for serving:
| Configuration | Weights | GPU queues | When to use |
|---|---|---|---|
| Single serving, default stream | 1× | 1 (shared) | Default. Simplest; fine for single-user / batched. |
Single serving + pool of Emily.Streams | 1× | N | Concurrent inference on a shared model. Large models. |
| K servings (pooled), default stream | K× | 1 (shared) | Small models where CPU serving work dominates GPU. |
The README's Concurrency model section has the worked code; this note is for the architecture map only.
Memory model
MLX buffers live outside the BEAM heap. Emily.Memory is the public
allocator API:
stats/0samples active / peak / cache bytes and emits the[:emily, :memory, :stats]telemetry event.reset_peak/0resets the high-water mark.clear_cache/0asks MLX to release cached reusable buffers — does not free live tensors. Tensors and resource binaries returned byNx.to_binary/1are released only after the owning BEAM references are garbage collected.
Emily.Telemetry.memory_stats/0 delegates here for back-compat; new
code should call Emily.Memory directly.
Observability
Span events at the evaluation boundary:
[:emily, :eval, *]— everyEmily.eval/1and the implicit evaluation insideto_binary.[:emily, :to_binary, *]— bothEmily.to_binary/1and theNx.Backend.to_binarypath. Metadata::shape,:dtype,:byte_size.[:emily, :fallback, *]— everyNx.BinaryBackendfallback entry. Metadata::op,:input_shapes,:input_dtypes.[:emily, :block, :fallback]— discrete event each time the backend'sblockcallback (Nx 0.12+Nx.Block.*dispatch) falls through to the supplied defaultfun.[:emily, :memory, :stats]— discrete event fromEmily.Memory.stats/0.
Span instrumentation deliberately stops at the evaluation boundary
rather than wrapping every graph-construction call site in
Emily.Backend: those NIFs are <10μs and do no work; the evaluation
boundary is where MLX actually runs kernels.
Fallback behaviour is configured via config :emily, :fallback:
:silent(default) — only the telemetry event fires.:warn— one-shotLogger.warningper{op, input_shapes}pair.:raise—RuntimeErroron fallback entry; CI-friendly.
The legacy config :emily, :warn_on_fallback, true boolean is still
honoured when :fallback is unset.
Debug assertions
Two compile-time opt-in flags re-enable runtime checks that MLX (and
every other GPU backend) skips by default. Both default false; the
guarded branches are dead-code eliminated by the Elixir compiler, so
the runtime cost when off is zero.
:debug_bounds_check— raises on out-of-range / negative indices ingather/take/take_along_axis/indexed_add/indexed_put.:debug_detect_nan_inf— scans results ofmatmul, the fusedlayer_norm/rms_norm, and both fused SDPA variants.
Each check is a per-op MLX reduction plus a scalar readback — a worker sync that breaks lazy-graph fusion. Leave off in release builds.
Build & packaging
- Hex consumers download a precompiled NIF (
libemily.{so,dylib}mlx.metallib) from the GitHub release for the pinned version, SHA256-verified against a.sha256sidecar fetched alongside. No C++ toolchain or cmake required.
- Contributors build from source:
mix deps.getclones MLX intodeps/mlx_srcat the pinned tag,scripts/build-mlx.shcmake-buildslibmlx.a+mlx.metallibinto$EMILY_CACHE/mlx-<version>-<variant>/, andelixir_makelinks the NIF against it. - MLX variant selection is via
config :emily, variant: :aot | :jit.:aotis the default and works on macOS 14+;:jitships smaller artefacts but requires macOS 26.2+ at runtime. mix emily.doctorverifies the local install: host platform, active variant, native artefacts inpriv/, NIF loadability, and a tinyEmily.Backendsmoke test that asserts no silent fallback toNx.BinaryBackend.
Testing philosophy
Each layer is tested against its own oracle. A bug can only be introduced in the layer where its test fails — no cross-layer mystery bugs.
| Layer | Oracle | Harness |
|---|---|---|
| Native | Hand-computed expected values | ExUnit unit tests |
| Backend | Nx.BinaryBackend on the same inputs | StreamData property tests + Nx conformance |
| Compiler | Emily.Backend in non-defn mode | Equivalence tests (same function, two modes) |
| Grad | Nx.BinaryBackend grad + finite differences + EXLA CPU | StreamData property tests + numerical oracle + EXLA golden |
| Training | Nx.BinaryBackend loss trajectory | Curve-matching; MNIST convergence (:training_full) |
| E2E | HuggingFace Transformers reference slices | Bumblebee conformance suites with cached weights |
Soak harnesses (all under test/soak/, @tag :soak, opt-in):
memory_test— 10k iterations; MLX memory returns to baseline.training_test— 1k training steps; baseline restored afterEmily.Memory.clear_cache/0.backend_concurrency_test/eval_concurrency_test/stream_concurrency_test— parallel inference under the default worker, the evaluation path, and per-process streams; determinism + no crashes.backend_soak_test— broad backend exerciser; allocator drift over a large mixed-op workload.quantized_memory_test— quantized-matmul loop, distinct allocator pattern from fp16 inference.zero_copy_roundtrip_test—to_binaryaliases the MLX buffer rather than copying; tested via allocator-stats deltas.
Risks and mitigations
| Risk | Mitigation |
|---|---|
| MLX op semantics drift from Nx expectations | Property tests explicitly generate edge cases; document intentional divergences |
| Metal driver bugs in specific macOS versions | Pin known-good macOS in CI; test matrix across 14/26 |
| f16/bf16 accumulation differences from EXLA | Tolerance-aware comparisons; document expected divergence |
Upstream Nx API changes (Nx.Backend / Nx.Defn.Compiler etc) | Version-pin Nx; coordinate with elixir-nx maintainers |
| MLX upstream API churn on source builds | @mlx_version pin; audit on bump; mix emily.doctor surfaces toolchain mismatches |