All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.7.1 - 2026-06-13
Fixed
- Documentation no longer fails to build over autolink references to the
hidden
Emily.Native.async_eval/2andEmily.Native.fast_rope_int/8NIF stubs in the changelog; both are excluded from ex_doc autolinking.
0.7.0 - 2026-06-13
Added
Native Expr compiler — on by default under
compiler: Emily.Compiler. Lowers a tracedNx.Defn.Exprto a flat IR once and replays the whole forward graph in a single NIF call per invocation, collapsing the per-op BEAM↔worker round-trips a step-evaluated decode loop would otherwise pay. Weights cross the NIF boundary once (captured by the compiled program) and are never re-serialised per call. It is the default, so a barecompiler: Emily.Compilercompiles native:Nx.Defn.jit(&forward/1, compiler: Emily.Compiler).(input)Coverage is the full Nx primitive set (with
Emily.Backend's dtype-coercion and op-composition semantics ported into the lowering), the fusedEmily.Fast.*kernels (RMSNorm, LayerNorm, RoPE, scaled dot-product attention and its mask / sink / mask+sink variants),Nx.Block.*including the fullLinAlgfamily (cholesky/solve/qr/eigh/lu/svd/determinant),Nx.Random, and the control flowcond/defn while(with the host loop driven entirely from the worker thread). Anything the IR can't lower yet routes throughNx.Defn.Evaluatorunder the defaultnative_fallback: :eval(with a one-shot[:emily, :compiler, :fallback]telemetry event), so the native lane is safe as the default on any model. The default is read fromconfig :emily, :native(defaulting totrue), soconfig :emily, native: falseopts every defn out of the native lane application-wide — e.g. on a memory-constrained host where the one-shot compile peak is too large; a per-callnative:option always wins over the app-env default.native_fallback: :raisefails instead — the conformance suites use this to prove a model lowers fully native.End-to-end: DistilBERT (question answering with
Nx.Serving), ViT, Whisper (speech_to_textend-to-end including the featurizer STFT, encoder/decoder, and autoregressive decode loop), and BumblebeeText.generation(greedy and multinomial sampling) all compile fully native undernative_fallback: :raise. Bumblebee generation on Qwen3-0.6B measures ~5× the evaluator's decode throughput (~61 vs ~12 tok/s on an M-series Mac), with byte-identical completions. Native training drives Axon end-to-end — a LeNet CNN and a dense MLP train on real MNIST entirely through the single-NIF path (forward, categorical-cross-entropy, backward, Adam) to the same >97% / >96% accuracy as the evaluator.Emily.Compiler—:fuseopt-in. Addsmx::compilefusion on top of the replay, fusing elementwise runs (RMSNorm, softmax, SiLU gating, residual adds) the plain replay leaves as separate kernels. For adefn while, the loop body is fused undermx::compileand cached per stream so it cache-hits across iterations rather than recompiling per step. Enable on top of the native generation path:Nx.Defn.jit(&forward/1, compiler: Emily.Compiler, native: true, fuse: true)On Qwen3-0.6B this lifts greedy decode to ~5.4× the evaluator (~1.1× over the plain native lane), ~68 vs ~62 tok/s; in isolation on a decode-shaped transformer block, fusion measures ~1.5–1.6× over the plain replay. Trade-off:
mx::compilereassociates f32 to within a few ULP, so output is not bit-identical to the evaluator. Greedy argmax is robust to that empirically (Qwen3-0.6B token ids matched the evaluator exactly in our run), but the match is empirical, not guaranteed — a near-tie top-2 logit can flip a token. Sampling strategies will diverge from the evaluator under fusion even with a fixed seed.Emily.Generation— a model-agnostic decode-loop driver. JIT-compiles a caller-supplied shape-stable per-token forward (fn token, offset, cache, params -> {logits, cache} end) with the native single-NIF compiler and drives the autoregressive loop from Elixir — offset bookkeeping, KV-cache threading, stop conditions, next-token selection (greedy by default), and per-token streaming via:on_token. The forward runs fully native; the loop stays in Elixir, so token streaming and host-side control are preserved. Emily supplies only the mechanism — the model (forward + cache) is the caller's.Emily.async_eval/1(andEmily.Native.async_eval/2) schedule evaluation of one or more lazy graphs without blocking on the GPU, wrappingmlx::core::async_eval. The work is handed to the device's command queue and the call returns as soon as it is enqueued — not when it finishes. Lets a caller keep dispatching the next step's ops while the device computes the current one (e.g. an autoregressive decode loop), blocking only when a value is actually read back on the host viato_binary/1/eval/1. Pass every output of a step (logits plus all KV-cache buffers) in one call.Emily.Native.fast_rope_int/8— RoPE with an integer absolute-positionoffset(routing to MLX's int-offsetropeoverload), for incremental decode where the caller tracks position host-side. Complements the existing tensor-offsetfast_rope/8. Note: feed the kernel the 4-D{batch, heads, seq, head_dim}layout — in 3-D, MLX 0.31 mis-rotates single-token (seq == 1) inputs.
Fixed
- Dilated window reductions (
window_dilations > 1) returned wrong values.window_sum/window_max/window_min/window_productwith a dilated kernel silently produced garbage for windows past the first stride positions, on both the eager backend and the native compiler (they share the window-reduce core). A dilated kernel axis gets anas_stridedstride > 1, so the sliding-window view aliases fewer physical elements than its logical size; MLX's strided-reduce fast path then read past the aliased buffer. The view is now materialised contiguously before the reduce when any dilation > 1 (the common non-dilated pooling path is unchanged and stays copy-free).
0.6.1 - 2026-05-31
Changed
- Documentation updated for the 0.6.x release: the README installation
instructions and the example notebooks now reference
{:emily, "~> 0.6"}.
0.6.0 - 2026-05-31
This release is a security-hardening pass over the native (NIF) boundary
and the build/release pipeline: direct Emily.Native calls now validate
their arguments instead of trusting Elixir-side normalization,
precompiled-NIF downloads verify against a checksum pinned in the hex
package (a trust root independent of the GitHub release), and the
per-stream worker is bounded and tears down without blocking a BEAM
scheduler. It is backward compatible, but two behaviour changes matter
for high-concurrency callers: the per-worker async queue is now bounded
(worker_queue_limit, default 8192) and rejects when full, and a stopped
or dropped worker replies {:error, :stopped} to queued callers instead
of running their work.
Added
Emily.Stream.close/1stops a stream's worker thread deterministically instead of waiting for garbage collection: queued operations are cancelled (their callers get aRuntimeError), the in-flight op finishes, and the OS thread is joined off the BEAM schedulers.config :emily, worker_queue_limit: N(default8192) bounds the per-worker async queue, andconfig :emily, await_timeout: ms(default:infinity) sets an optional timeout for awaiting native results.
Security
Worker-thread teardown no longer blocks a BEAM scheduler. The resource destructor previously drained the worker's entire queue and joined the OS thread inline, so collecting a busy stream during GC could stall a scheduler. Workers are now joined off-scheduler by a dedicated reaper (itself joined at NIF unload), and on stop the worker cancels its queued tasks — replying
{:error, :stopped}— instead of running them.The async NIF worker queue is now bounded (
worker_queue_limit, reject when full) so a flood of operations can't grow it without limit and pin host/GPU memory, and a stopped or dropped worker now replies{:error, :stopped}to every queued caller instead of leaving it blocked forever.Emily.Native.worker_queue_depth/1exposes the depth for observability.The dev/CI source-build path now refuses to trust an MLX install directory it doesn't own and keeps the build cache
0700, so a shared or attacker-controlledEMILY_CACHEcan't plant alibmlx.athat is then statically linked into the NIF. Fixed system tools (getconf,id,sw_vers, plusxcrun/sysctl/psinbuild-mlx.sh) resolve from absolute/system paths rather than$PATH, and the MLX-build lock records the holder's process start time so a recycled PID can't be mistaken for the original holder. Build-time only; no runtime change.Precompiled NIF downloads are now verified against checksums pinned inside the hex package (
native_checksums.txt) rather than a.sha256sidecar fetched from the same GitHub release as the tarball. Because the package contents are covered by Hex's package hash in the consumer'smix.lock, the trust root no longer lives in the mutable release. The tarball is also extracted with:erl_taragainst a strict entry allowlist (libemily.{so,dylib}+mlx.metallib), rejecting symlinks, hardlinks,..traversal, absolute paths, and unexpected entries — closing a path-traversal/arbitrary-write vector in the oldtar -xzfextraction. Newmix emily.checksumstask regenerates the pinned file per release.Integer arguments crossing the NIF boundary are now range-checked before being narrowed from Elixir's
int64to C++int. Previously an out-of-range axis, count, or shape entry wrapped silently (e.g. an axis of2^32 + 3became3), dispatching the wrong MLX operation; and unbounded sample counts inrandom_split/random_categoricalcould drive huge allocations. Out-of-range values, and negative counts, now raiseArgumentError. Centralized aschecked_int/require_counthelpers applied across the reduce, shape, sort, random, index, linalg, conv, and fast NIFs.Native indexing and window NIFs now validate their vector arguments against the tensor rank before indexing, and reject non-positive strides, dilations, and window dimensions. Previously a direct
Emily.Nativecall with a malformedslice_updatestart, a short pad/window vector, or a zero window stride could read a C++ vector out of bounds or trigger an integer divide-by-zero (SIGFPE) — both of which crash the whole BEAM VM rather than raising in the caller. They now raiseArgumentError.Emily.Native.from_binary/3now validates tensor shapes at the NIF boundary. Dimensions aboveINT32_MAXare rejected (previously they silently truncated through MLX'sint32ShapeElem), and the element and byte counts are computed with overflow checking. Without this an attacker-chosen shape whose element product wrapped (e.g.[2^21, 2^21, 2^22]→0) could pass the binary-size check against an undersized — even empty — binary and build an array whose shape outran its allocation, an out-of-bounds read on the nexteval/to_binary.Emily.Native.conv_general/8now rejects a non-positivegroupsargument withArgumentErrorinstead of crashing the BEAM VM. MLX's convolution checks computein_channels % groups, sogroups <= 0(or a large value that narrows to zero through theint64 → intconversion) was an integer modulo-by-zero — a SIGFPE that bypassed the NIF's exception path and terminated the entire node. The guard validates the un-narrowed value at the NIF boundary.
0.5.1 - 2026-05-23
Fixed
CHANGELOG.md— corrected the 0.5.0 entry. The published release carried two### Changedheadings and listed three new-functionality items (mix emily.doctor,config :emily, fallback:, and theEmily.Memorypublic allocator API) under Changed rather than Added. Merged the duplicate Changed sections, moved the new-functionality items to Added, and put items into reverse chronological order. No code change.
0.5.0 - 2026-05-23
Added
Emily.Quantization.dequantize_defn/1now supports thenvfp4microscaled mode in addition toaffine,mxfp4, andmxfp8— the full MLXQuantizationModeenum now runs through the defn-native dequant path.nvfp4reuses the FP4-E2M1 lane LUT frommxfp4and the FP8-E4M3 LUT frommxfp8(consumed against the per-group scale bytes rather than lane codes — the NVIDIA microscaled convention uses finer-grained group_size=16 with FP8-E4M3 scales instead of mxfp4/mxfp8's group_size=32 with FP8-E8M0 scales). Output dtype is bf16 to matchQuantizedWeight.to_dense/1, round-trip is bit-identical (max abs diff = 0.0).Emily.Quantization.Transformacceptsmode: "nvfp4".Emily.Quantization.dequantize_defn/1now supports themxfp8microscaled mode in addition toaffineandmxfp4. Each 8-bit lane code decodes through a 256-entry FP8-E4M3 lookup table precomputed via MLX'sFromFP8bit-trick (strip sign, shift the low 7 bits left by 7 to align the E4M3 exponent into f16's exponent field, multiply by 256 for the bias difference, restore sign). Per-group scales reuse the FP8-E8M0 decode from the mxfp4 path. Output dtype is bf16 to matchQuantizedWeight.to_dense/1, and the round-trip is bit-identical (max abs diff = 0.0) on realistic data.Emily.Quantization.Transformacceptsmode: "mxfp8"; onlynvfp4(which uses an FP8-E4M3 per-group scale instead of FP8-E8M0) remains defn-unsupported.Emily.Quantization.dequantize_defn/1now supports themxfp4microscaled mode in addition toaffine. Each 4-bit lane code decodes through MLX's FP4-E2M1 lookup table (+0.0, +0.5, +1.0, +1.5, +2.0, +3.0, +4.0, +6.0and their negatives); each u8 scale byte decodes through2^(s - 127)(FP8-E8M0). Output dtype is bf16 to matchQuantizedWeight.to_dense/1, and the round-trip is bit-identical (max abs diff = 0.0) on realistic scale bytes because every FP4 LUT entry and every E8M0 power-of-two is exact in bf16.Emily.Quantization.Transformgains a:modeoption (default"affine", accepts"mxfp4");mxfp8andnvfp4are still defn-unsupported and route through the Native NIF.Emily.Quantization.dequantize_defn/1now supports int3 and int6 weights in addition to int2/int4/int8. The new path reads each lane's two adjacent u32 words as a u64, shifts by the in-word bit offset, and masks — handling the cross-u32 packing MLX uses for bit widths that don't divide 32 cleanly.defn_supported_bits/0now returns[2, 3, 4, 6, 8]; quantized Axon graphs rewritten viaEmily.Quantization.Transform(andEmily.Quantization.Layers.quantized_dense/4) pick the expanded set up automatically. Previously the defn path rejectedbits ∈ {3, 6}and callers had to fall back toQuantizedWeight.to_dense/1(the Native NIF).ARCHITECTURE.md— current shape of the library extracted fromPLAN.md. Covers the four-layer dispatch model, the worker-thread- per-process-stream concurrency model, the public
Emily.Memoryallocator API, the telemetry event catalogue, the:debug_bounds_check/:debug_detect_nan_infcompile-time flags, build/packaging notes, the per-layer testing oracle table, and the active risk register. Linked from the README under a new Documentation section and grouped under "Project" in the HexDocs sidebar.
- per-process-stream concurrency model, the public
ROADMAP.md— active and future work, separated from the historical milestone log. Lists deferred-to-post-1.0 items (typed exceptions, GPU interop pointers, source-build doctor probes) and the open in-roadmap MLX capability gaps (sparse / MoE matmuls, FP8 dtype,ThreadLocalStream).mix emily.doctor— diagnostic Mix task that verifies the local Emily runtime installation. Checks the host platform (OS, arch, macOS version against the active variant's minimum), the active MLX variant,priv/libemily.soandpriv/mlx.metallib, NIF loadability, and a tinyEmily.Backendsmoke test that asserts the result didn't silently fall back toNx.BinaryBackend. Checks short-circuit: when a prerequisite fails, dependent checks report[skip]rather than producing cascading noise. Supports--variant aot|jitfor "would this host satisfy :jit?" probes and--helpfor usage.config :emily, fallback: :silent | :warn | :raise— strict fallback modes for development and CI.:silent(the default) preserves today's behaviour;:warnemits the one-shotLogger.warningper{op, input_shapes}pair previously gated by:warn_on_fallback;:raiseraisesRuntimeErrorwith op, shapes, and dtypes on entry, letting CI fail the build when a hot path unexpectedly routes throughNx.BinaryBackend. An invalid:fallbackvalue raisesArgumentErroron the first fallback so typos surface immediately.Emily.Memory— public allocator API for long-running serving and training workloads that need to observe and manage MLX memory without reaching intoEmily.Native. Exposesstats/0(active, peak, and cached bytes, also emitting[:emily, :memory, :stats]),reset_peak/0, andclear_cache/0. Documented under the README's Observability section and grouped withEmily.Telemetryin the ExDoc sidebar.
Changed
PLAN.mdslimmed to its milestone-history role. The current-shape sections (architecture diagram, core design decisions, testing philosophy, risks-and-mitigations) moved toARCHITECTURE.md; goals, non-goals, and deferred-milestone summaries moved toROADMAP.md. The M0–M27 milestone narratives, the ratified project decisions, and the 2026-04-22 MLX capability audit stay inPLAN.mdas the historical record. The stale "narrowwith_stream/2+new/1+synchronize/1surface" reference (nosynchronize/1ever shipped) and the plannedset_default_stream/1primary deliverable (removed during the post-M14 fixes) drop out with the prologue rewrite.Emily.Nativenow annotates NIF errors with operation, input shape/dtype, options, and worker context.ArgumentErrorandRuntimeErrorraised from async ops get anEmily.Native context: op=… inputs=[…] options=[…] stream=…suffix, so common failures (shape mismatches inmatmul, divisibility errors inquantize, mask shape bugs infast_scaled_dot_product_attention, etc.) are diagnosable from the message alone. The error-formatting path is total — bad context maps degrade to?markers rather than masking the underlying NIF error.- The legacy
config :emily, :warn_on_fallback, trueboolean is soft-deprecated in favour of:fallback. It is still honoured when:fallbackis unset (true→:warn); when both are set,:fallbackwins. Emily.Telemetry.memory_stats/0now delegates toEmily.Memory.stats/0. Behaviour is unchanged — same event, measurements, and return shape — but new code should prefer theEmily.Memoryentry point.
0.4.0 - 2026-05-17
Changed
- Upgraded to Nx 0.12 / Bumblebee 0.7 / Axon 0.8. Nx 0.12 replaces
the optional-callback list (
lu,svd,qr,cholesky,eigh,solve,take,take_along_axis,fft2,ifft2,cumulative_*,logical_not,all_close) with a single genericNx.Backend.block/4dispatch keyed onNx.Block.*structs.Emily.Backendnow routes every previously-native op throughblock/4, preserving the MLX fast paths without losing the BinaryBackend fallback when an unknown block arrives. ExistingEmily.Backendconsumers see no behavioural change. - Migrated
Emily.Fast.*from the now-removedNx.Defn.Expr.optional/3extension point toNx.block/4. Each fused kernel (rms_norm,layer_norm,rope,rope_with_freqs,scaled_dot_product_attentionwith and without mask/sinks) now emits anEmily.Fast.Block.*struct thatEmily.Backend.block/4pattern-matches to the matchingmx::fast::*NIF. The composed-defn fallbacks under non-Emily backends are unchanged. - Bumblebee 0.7 ships Qwen3 first-class, so
notebooks/qwen3_quantized.livemdno longer needs themain-ref Bumblebee pin from the 0.6.3 era.
Added
Nx.rfft/2andNx.irfft/2support. The underlyingNative.rfftn/Native.irfftnNIFs were already in place from earlier MLX work; Nx 0.12 surfaces these as backend-block ops so Emily wires them up at no MLX-side cost.- Smoke tests for three new Bumblebee 0.7 model families on
Emily.Backend: NomicBERT (:nomic_embeddings), SmolLM3 (:smollm3), and ModernBERT (:modernbert). All three drive a tiny synthetic spec end-to-end throughAxon.predictso they remain offline-friendly; tagged:conformance. - Runnable Livebooks for each of the three new Bumblebee 0.7
families:
notebooks/nomic_embeddings.livemd(NomicBERT embeddings with cosine similarity),notebooks/smollm3_chat.livemd(SmolLM3-3B chat completion with a<think>toggle for hybrid reasoning), andnotebooks/modernbert_classification.livemd(ModernBERT NLI fine-tune). All three are published under the HexDocs Notebooks group. - A
[:emily, :block, :fallback]telemetry event fires wheneverEmily.Backend.block/4falls through to the supplied defaultfun. Surfaces ops we used to handle natively but now land on the composed-defn path — useful in soak runs to spot silent regressions after a Bumblebee bump.
Fixed
mix docsno longer emits autolinker warnings for theEmily.Backend.block/4andNx.Defn.Expr.optional/3references in theEmily.FastandEmily.Fast.Blockmoduledocs. The references resolved to@doc falsecallees (the backend callback is hidden byNx.Backend, andoptional/3was removed in Nx 0.12); the prose stays, theMod.fun/arityshape is broken up so the autolinker no longer follows it. Same pattern as the earlier fix inee32c7c.
Removed
{:f8_e4m3fn, 8}(introduced in Nx 0.11) is rejected at the backend boundary with the same "no MLX primitive"ArgumentErrorpattern as{:f, 64}. MLX has no float-8 dtype; cast to:f16or:bf16.
0.3.5 - 2026-05-03
0.3.4 - 2026-05-03
Fixed
Nx.LinAlg.svd(tensor, full_matrices?: false)on rank-2 inputs no longer routes through MLX's full-matrices SVD and post-slices — MLX's SVD has no thin switch, so the old path materialised the full m × m U on device and instantly OOM'd Metal for tall matrices like the Qwen3-0.6B embedder kernel (151936 × 1024 → ~92 GB U). The thin case now computesG = MᵀM → eigh → S, V; U = MV / S(or the symmetricMMᵀroute for wide matrices), keeping the decomposition at min(m, n)². See theEmily.Backendmoduledoc Divergences section for the numerical caveat (the Gram step squares M's condition number). Refs #84.mix docsruns cleanly. The MNIST notebook referencedAxon.Loop'strainer/2(no such arity); three other inline references resolved to@doc falsecallees in upstream libraries (Nx.Defn.Expr'soptional/3, Bumblebee'srms_norm/2) and triggered autolinker warnings on every doc build. The notebook now uses the correcttrainer/3arity, and the prose references have been reshaped so the autolinker no longer follows them, keeping the build warning-free for future--warnings-as-errorsenforcement. Refs #83.
0.3.3 - 2026-05-03
Fixed
Emily.Compilernow silently drops options it doesn't recognise instead of raisingArgumentError. This matches the behaviour ofNx.Defn.Evaluatorand EXLA, and restores compatibility with higher-level libraries that forward caller-supplied options through the JIT compiler — notablyAxon.build/2, whose contract states that "all other options are forwarded to the underlying JIT compiler". Hit when running a Bumblebee-built Axon model withAxon.predict(..., global_layer_options: [output_hidden_states: true])under Emily as the global defn compiler. Refs #81.
0.3.2 - 2026-04-25
0.3.1 - 2026-04-25
Fixed
- Precompiled NIF download no longer times out on the
:peer.call/4default 5sgen_server.calldeadline. Consumers installing{:emily, "~> 0.3"}on a cold cache could see:gen_server.calltimeouts while fetching the multi-MB tarball; the.sha256sidecar fit in the window but the main asset did not. The peer RPC now runs with:infinityso httpc's own request timing drives cancellation.
0.3.0 - 2026-04-25
Changed
- Hex consumers now receive a precompiled NIF
(
libemily.{so,dylib}+mlx.metallib) instead of source. Firstmix compiledownloads the matchingemily-nif-<v>-<variant>- <target>.tar.gz(and its.sha256sidecar) from the emily GitHub release for the pinned version, verifies the tarball against the published SHA256, and extracts intopriv/. No cmake / Xcode / C++ toolchain is needed on the consumer side. - In-repo / CI builds now clone MLX's source via a Mix git dep
(
:mlx_src) and build libmlx from source;release-mlx.ymlis retired. - Variant selection is unified under the
:variantapp-config key (:aot|:jit). Contributors flip variants viaEMILY_MLX_VARIANT=jit(read byconfig/config.exs); consumers setconfig :emily, variant: :jitin their ownconfig/config.exs. The old:mlx_variantkey andconfig/local.exsoverride are gone. - macOS default cache location moves from
~/Library/Caches/emily/toDARWIN_USER_CACHE_DIR(/private/var/folders/<hash>/C/emily) — the per-user sandboxed cache root Apple's own sandboxed apps use. Persistent across reboots, lives outside~/Library/. Linux / Windows still use the XDG convention. Override viaEMILY_CACHE. Existing macOS users canrm -rf ~/Library/Caches/emily/to reclaim the orphaned data after upgrade. - NIF object files move from the user-level cache to
$(MIX_APP_PATH)/obj/(i.e._build/<env>/lib/emily/obj/). As a consequence, plainmix cleannow correctly removes them via the existing Makefile rule — they were previously left behind becausemake cleandidn't see the cache-dir env vars.
Added
.github/workflows/release-nif.yml— on bare-semver tag push, builds the precompiled NIF for each(variant × target)cell and uploads tarball +.sha256sidecar to a draft GitHub release.workflow_dispatchis also wired for out-of-band rebuilds (artefacts go to workflow storage; the release is untouched).mix clean.mlx— wipes the MLX install dir(s) under the cache. Plainmix cleandeliberately preserves them since rebuilding MLX from source is ~5-7 minutes.
Fixed
- MLX source builds are now atomic. The build script installs into
${PREFIX}.stagingand onlymvs onto the final path after the artefact sanity checks pass; an EXIT trap wipes the scratch dirs on failure. Previously, an interrupted build (Ctrl-C, killed process, concurrent run) left an empty install dir that subsequentmix compileruns misread as "MLX is already installed", silently skipping the build and bombing out inelixir_makewithmake: *** No rule to make target '.../mlx.metallib'. The compile-time check now requires bothlib/libmlx.aandlib/mlx.metallibto be present before trusting the dir. - Concurrent invocations of
build-mlx.shagainst the same install prefix are now serialised via amkdir-based lock with stale-PID reclaim. ElixirLS uses its own build path (.elixir_ls/build/...) so an LSP-drivenmix compileand a CLImix compile.emily_mlx --forcelock on differentMix.Project.with_build_lockkeys and freely raced into the same MLX cache dir, clobbering each other's${PREFIX}.build/mid-build and surfacing asclang ... Rename failed: ... No such file or directoryduring Metal-shader compilation. - CMake's FetchContent sub-build of metal_cpp / json / fmt during
configure runs with
CMAKE_BUILD_PARALLEL_LEVEL=1, dodging a race in its download → extract → rename → stamp-touch pipeline that surfaced asgetcwd: cannot access parent directoriesfollowed bycd: <dir>/_deps: No such file or directory. The main MLX build still runs at full NCPU jobs. - The MLX scratch build dir (
${PREFIX}.build) is preserved on configure failure soCMakeError.logsurvives for diagnostics.
Removed
config/local.exsoverride (obsoleted by the env-var plumbing)..github/workflows/release-mlx.yml(MLX build is folded into the NIF workflow).scripts/build-mlx-prebuilt.sh(superseded by in-treescripts/build-mlx.sh).scripts/smoke-test-package.shand the taggedsmoke-testjob inci.yml(simulated a source-compile consumer, no longer applicable).
See MAINTAINING.md for the updated release flow.
0.2.2 - 2026-04-23
Fixed
- MLX prebuilt download now runs on a peer VM (
:peer.start_link/1with stdio connection) so it is unaffected by Mix's code-path pruning during dep compilation. Previous releases crashed in the taggedsmoke-testCI lane with{:error, :nofile}/ "module :public_key is not available" on clean caches, because Mix removed the:ssl/:public_key/:asn1/:inetsebin directories from the parent VM's code path even though the apps were started. The peer node has a fresh code path, so standardhttpc+public_keywork without further shimming.
0.2.1 - 2026-04-22
Fixed
mix compilecrash on a cold MLX download in a clean consumer project.http_download!/2inmix.exscalled:public_key.cacerts_get/0right afterApplication.ensure_all_started(:ssl). The app-start path pulled:public_keyin transitively, but the module itself was not guaranteed to be loaded at call time — the tag-triggered Hex smoke test on CI blew up withUndefinedFunctionError ... module :public_key is not availableon 0.2.0.http_download!now force-loads the module via:code.ensure_loaded/1before touching it. Any checkout with a populated~/Library/Caches/emily/mlx-<v>-*directory skipped this path, which is why the break only surfaced in the first clean CI run.
0.2.0 - 2026-04-22
Added
- MLX prebuilt-release workflow
(
.github/workflows/release-mlx.yml). Manual workflow that buildslibmlx.a+mlx.metallib+ headers from a chosenml-explore/mlxtag and uploads the tarball to a draft GitHub release taggedmlx-<version>on this repo. Used to produce the prebuilts that Emily's compile step downloads instead of the previous source-build path. To cut a new MLX prebuilt release:- Run the workflow with
build_type=no-jiton macos-14 (producesmlx-<v>-macos-arm64-aot.tar.gz). - Run it again with
build_type=jiton macos-26 (producesmlx-<v>-macos-arm64-jit.tar.gz). - Copy the two SHA256s from the draft release's
.sha256sidecars into@mlx_checksumsinmix.exs. - Un-draft the release so consumers can fetch.
The heavy lifting sits in
scripts/build-mlx-prebuilt.sh, which runs standalone for local debugging:scripts/build-mlx-prebuilt.sh path/to/mlx-src 0.31.2 0.
- Run the workflow with
Emily.Fast.einsum/2— eager-only wrapper around MLX's path-optimisedmx::einsum. Accepts a standard Einstein-summation string and a list ofEmily.Backend-backed tensors; MLX picks the contraction order internally. Operands on any other backend raiseArgumentErrorwith a transfer-first message. The helper is a direct-call eager helper (same pattern asEmily.Quantization.quantized_matmul/2) and is intentionally notdefn-callable — a fallback viaNx.Defn.Expr'soptional/3would require a full einsum-string parser and is deferred until a user needs cross-backend composability.
Fixed
Nx.top_k/2on Emily tensors. The backend'stop_k/3override pattern-matchedoutas a single%Nx.Tensor{}and returned a single tensor, but the real Nx callback contract takes{out_values, out_indices}and returns a{values, indices}tuple. Any call toNx.top_kraisedFunctionClauseError. Dropped the override so Nx falls back toargsort(:desc) + take_along_axis + slice_along_axis, each of which routes through Emily's backend.
Changed
- MLX prebuilt download replaces the vendored source build. The
vendor/mlxsubmodule and the cmake-from-source path are gone.mix compilenow downloads a SHA256-verifiedlibmlx.a+mlx.metallib+ headers tarball for the pinned@mlx_versionfrom this repo's releases into$EMILY_CACHEand links the NIF against it directly. Consumer prerequisites drop from "Xcode + Metal toolchain + cmake + submodule checkout" to just macOS Apple Silicon. The JIT / no-JIT switch moves from theEMILY_MLX_JITenv var toconfig :emily, mlx_variant: :jit | :no_jitinconfig/config.exs(default:no_jit); variant is read viaConfig.Reader.read!at project load, so a gitignoredconfig/local.exsis the supported per-checkout override. Version bumps are a single-commit change of@mlx_version+@mlx_checksumsinmix.exs, paired with a newmlx-<version>GitHub release produced byrelease-mlx.yml. First MLX pin under the new scheme: 0.31.2. - Microscaled quantization modes on
Emily.QuantizedWeight. The container now carries a:modefield (default"affine") and accepts"mxfp4","mxfp8","nvfp4"— MLX's fullQuantizationModeenum (vendor/mlx/mlx/primitives.h:155).from_dense/2,to_dense/1, andEmily.Quantization.quantized_matmul/2all thread the mode through to MLX; mode-specific{group_size, bits}constraints are validated up front with a clear Emily error before the NIF call. Microscaled modes carry a placeholder biases tensor — MLX'sfp_quantizereturns only(wq, scales), and the Native layer substitutesnilbefore the MLX call.Emily.Quantization.dequantize_defn/1is affine-only (it's a hand-rolled nibble unpacker) and now raisesArgumentErroron non-affine modes, pointing users atto_dense/1. Smoke-tested end-to-end on Metal for all four modes (Apple Silicon, macOS 26). - SDPA attention sinks (
mx::fast::scaled_dot_product_attentionsinksparam).Emily.Fast.scaled_dot_product_attention/4andscaled_dot_product_attention_with_mask/5now accept an optional:sinkskeyword opt — a per-head tensor broadcastable to{1, heads, 1, 1}whose entries participate in the softmax denominator as extra "null destinations" (StreamingLLM). When absent the helpers emit the pre-existing optional-node, soEmily.Bumblebee.FastKernelsand direct callers stay source- and bit-compatible. The defn fallback implements the same semantics in numerically-stable form; equivalence vs. the fused kernel was measured at ~2e-7 max-abs-diff on f32. - MLX JIT build no longer patches vendored MLX. The
patches/mlx-jit-nax-gate.patchworkaround (and themaybe_apply_mlx_patchesplumbing inmix.exs) has been removed. The JIT build now requires the macOS 26.2+ SDK directly, which ships<MetalPerformancePrimitives/MetalPerformancePrimitives.h>; the AOT (default) build is unchanged and still works on older macOS. Upstream discussion: ml-explore/mlx#3426. - CI matrix split across macOS versions. The
jit=0row stays onmacos-14to keep AOT coverage on older macOS; thejit=1row now runs onmacos-26so the Metal Performance Primitives SDK is available natively. - Native axis reversal via
mx::slicewith stride -1. The descending branches ofNx.sortandNx.argsort(andNx.reverse) previously built anarangeindex tensor and gathered withtake. They now call a newNative.flip/3NIF that lowers to a single strided slice, saving the index allocation and gather kernel per call. - Parallel NIF C++ build.
elixir_makedoesn't pass-jby default andmix.exsdidn't set:make_args, so every.cppinc_src/compiled serially.mix.exsnow passes-j#{System.schedulers_online()}through, and the vestigialJOBS/MAKE_JOBSpair in theMakefile(computed but never referenced) has been removed. On an 8-core M-series, a clean NIF build drops from ~19 s to ~7 s.
0.1.2 - 2026-04-19
Fixed
- HexDocs source links.
mix.exs'ssource_url_patternprepended avprefix to the version tag, but the project's release convention (viamix publisho) uses bare semver tags. The generated[source]links in HexDocs pointed at nonexistentv<version>tags. Dropped the prefix so links resolve to the actual tag.
0.1.1 - 2026-04-19
Initial release. See the git history for per-milestone detail.
Added
- Nx backend.
Emily.Backendimplements every requiredNx.Backendcallback against MLX, with transparent fallback toNx.BinaryBackendfor ops without a native primitive. - Defn compiler.
Emily.Compilerrunsdefn/Nx.Serving/ Bumblebee on Emily; pins the result backend and caps partition concurrency soNx.Servingstays compatible. - Fused transformer kernels.
Emily.Fastexposesmx::fast::rms_norm,layer_norm,rope, and scaled-dot-product attention as defn-callable helpers with composed-defn fallbacks for non-Emily backends.Emily.Bumblebee.FastKernelsrewrites a Bumblebee Axon graph to call the fused kernels in place; declared as an optional dep on:axon+:bumblebee, elides cleanly if either is absent. - Affine group-wise quantization.
Emily.QuantizedWeightandEmily.Quantizationwrap MLXquantize/dequantize/quantized_matmulfor int2 / int4 / int8 inference.Emily.Quantization.dequantize_defn/1provides a defn-native dequantize for use inside Axon forward passes. - Mixed-precision training.
Emily.MixedPrecisionships the bf16 recipe:cast_paramsfor the forward pass, f32 master weights, dynamic loss scaling with overflow detection. - Per-process Metal streams.
Emily.Streamlets each BEAM process own its own Metal command queue, enabling concurrent inference on a shared model. - Zero-copy
to_binary.Nx.to_binary/1on an Emily tensor returns a BEAM resource binary aliasing the MLX buffer — no memcpy. - Native gradient + training primitives.
gather,scatter,scatter_add,conv, and the window-reduction family lower directly to MLX soNx.Defn.gradand CNN training stay native. - Native linalg.
lu,svd,qr,cholesky,eigh,solve, andtriangular_solvedispatch tomx::linalg::*instead of rounding throughNx.BinaryBackend. - Telemetry.
[:emily, :eval, *],[:emily, :to_binary, *],[:emily, :fallback, *], and[:emily, :memory, :stats]span events; opt-in one-shot fallback warnings viaconfig :emily, :warn_on_fallback, true. - Compile-time debug flags.
:debug_bounds_checkand:debug_detect_nan_infre-enable runtime assertions on hot paths; default off with zero runtime cost. - Bumblebee conformance. End-to-end suites for DistilBERT, Qwen3-0.6B (dense and quantized), ViT-base, and Whisper-tiny, pinned against HuggingFace reference values.
- Worker-thread dispatch. Each MLX stream is owned by a
dedicated OS thread. NIFs enqueue work on the worker and return
immediately; the worker posts the result back to the caller via
enif_send, and the public wrapper awaits it withreceive. No BEAM scheduler (regular or dirty) blocks on MLX work, and the per-thread MetalCommandEncoderstate stays consistent regardless of how the BEAM migrates Elixir processes between schedulers. - Vendored MLX build. MLX is built from source via cmake from
vendor/mlx(git submodule); no prebuilt download. Build cache keyed on the submodule SHA under~/Library/Caches/emily/. - Documentation. Per-module HexDocs, five runnable Livebooks
(
notebooks/distilbert_qa.livemd,notebooks/qwen3_quantized.livemd,notebooks/mnist_training.livemd,notebooks/whisper_transcription.livemd,notebooks/fast_kernels.livemd), and worked Bumblebee examples in the conformance suite.