Emily vs EMLX vs EXLA - performance comparison

Copy Markdown View Source

This report compares Emily against two Nx backend baselines:

  • EXLA: XLA host/CPU backend on Apple Silicon.
  • EMLX: older MLX-backed Nx backend on the Metal GPU.
  • Emily: local MLX/Metal backend, reported across eager, native, and fuse lanes. For conclusions, use the best Emily lane for the workload.

Harness: bench/emily_vs_exla.exs. Raw generated numbers: bench/emily_vs_exla_results.md. Large-model GPU-focused addendum: bench/qwen3_4b_emily_vs_emlx.exs.

Performance overview

The main benchmark suite runs five lanes in every tier: exla, emlx, emily-eager, emily-native, and emily-fuse.

Those lanes provide two complementary baselines:

BaselineQuestion answered
EXLA CPUIs MLX/Metal GPU faster than XLA host CPU for this workload?
EMLX GPUIs Emily's compiler/runtime faster than the older MLX-backed Nx stack?

The focused Qwen3-4B script includes an EXLA lane for explicit experiments, but its default run is GPU-only. On this 24 GB M4 Pro, Qwen3-4B bf16 on EXLA-CPU was killed by the OS during compile/run. The completed, canonical three-way Qwen comparison is therefore Qwen3-0.6B in the main suite.

Environment

Run on an Apple M4 Pro MacBook Pro with 24 GB RAM.

ComponentVersion / backend
Elixir / OTP1.19.5 / 28
Nx0.12.1
Emily0.7.0 local checkout
EMLX0.3.1, Metal GPU
EXLA0.12.0, host CPU client

Executive summary

Emily's best lane wins the main model tiers that are GPU-friendly:

TierEXLAEMLXBest EmilyBest lanevs EXLAvs EMLX
DistilBERT QA8.99 ms19.19 ms7.06 msnative1.27x faster2.72x faster
Qwen3-0.6B decode39.84 tok/s11.42 tok/s66.42 tok/sfuse1.67x faster5.82x faster
ViT-base image classification56.19 msERR23.93 msnative2.35x fastern/a
Whisper-tiny transcription88.49 msERR961.68 msnative10.9x slowern/a

The Qwen3-4B addendum shows the qualitative Emily-vs-EMLX difference clearly on the largest practical Bumblebee model for this machine:

LaneQwen3-4B tok/svs EMLX
EMLX7.331.00x
Emily eager8.031.10x
Emily native22.273.04x
Emily fuse23.463.20x

So the headline is: Emily's compiler path is the differentiator. Eager is roughly EMLX-like on Qwen3-4B; native/fuse are about 3.2x faster.

Visual summary

xychart-beta
  title "Qwen3-0.6B decode throughput"
  x-axis ["EXLA", "EMLX", "Eager", "Native", "Fuse"]
  y-axis "tokens/sec" 0 --> 70
  bar [39.84, 11.42, 12.51, 59.96, 66.42]
xychart-beta
  title "Qwen3-4B decode throughput"
  x-axis ["EMLX", "Eager", "Native", "Fuse"]
  y-axis "tokens/sec" 0 --> 25
  bar [7.33, 8.03, 22.27, 23.46]
xychart-beta
  title "Best Emily lane vs EXLA"
  x-axis ["DistilBERT", "Qwen3", "ViT", "Whisper"]
  y-axis "speedup x" 0 --> 3
  bar [1.27, 1.67, 2.35, 0.09]
xychart-beta
  title "Matmul crossover: best Emily / EXLA"
  x-axis ["128", "512", "1024", "2048"]
  y-axis "ratio (lower is faster)" 0 --> 2
  line [1.80, 1.12, 0.31, 0.20]

Tier 1 - op microbenchmarks

The op tier is still a latency/throughput crossover story. EXLA CPU wins the small launch-bound cases, while MLX/Metal wins once the tensors are large enough to amortize dispatch and kernel launch overhead.

Examples from the fresh run:

OpSizeWinnerSignal
add256EXLAbest Emily is 2.28x slower than EXLA
add4096Emilybest Emily is 2.0x faster than EXLA
exp4096Emilybest Emily is 3.2x faster than EXLA
softmax4096Emily fusebest Emily is 2.4x faster than EXLA and 1.29x faster than EMLX
matmul2048Emily/EMLX tieboth MLX lanes are about 5x faster than EXLA

Against EMLX, Emily's best op lane is usually close: sometimes a little faster, sometimes a little slower. The bigger EMLX-vs-Emily separation appears in traced model execution, especially Qwen decode, where Emily native/fuse avoid the op-by-op execution shape.

Tier 2 - DistilBERT QA

DistilBERT is a clean three-way win for Emily native:

Lanems/call
EXLA CPU8.99
EMLX GPU19.19
Emily eager16.45
Emily native7.06
Emily fuse8.50

Native is the right Emily option here. Fuse is not universally better; it helps most when a compiled body is reused, such as decode loops.

Tier 3 - Qwen3-0.6B decode

Qwen3-0.6B is the canonical completed three-way generation benchmark:

Lanetok/s
EXLA CPU39.84
EMLX GPU11.42
Emily eager12.51
Emily native59.96
Emily fuse66.42

This is the clearest main-suite result. Emily eager is near EMLX, while native/fuse jump far ahead. Fuse is the best choice for decode loops.

Tier 4 - ViT-base image classification

ViT-base strongly favors Emily native:

Lanems/call
EXLA CPU56.19
EMLX GPUERR
Emily eager38.78
Emily native23.93
Emily fuse27.50

This tier is GPU-friendly: larger matrix multiplies and enough work per forward for the GPU path to dominate. The EMLX lane did not complete in this harness, so the meaningful comparison here is Emily vs EXLA.

Tier 5 - Whisper-tiny transcription

Whisper-tiny remains Emily's bad case:

Lanems/call
EXLA CPU88.49
EMLX GPUERR
Emily eager1815.80
Emily native961.68
Emily fuse981.89

This is not a coverage win for EXLA; the Emily lanes reported zero fallbacks in the live run. It is a workload-shape problem: Whisper-tiny is made of many small kernels where CPU launch overhead and cache locality beat GPU dispatch. Native cuts eager roughly in half, but still cannot remove the underlying small-kernel cost. Fuse does not help this workload.

Qwen3-4B addendum

The focused Qwen3-4B script was rerun after the three-way reorganization. Its safe default remains GPU-only:

elixir bench/qwen3_4b_emily_vs_emlx.exs

Fresh result:

Lanemean tok/sminmaxvs EMLX
EMLX7.337.317.371.00x
Emily eager8.037.988.081.10x
Emily native22.2722.1622.323.04x
Emily fuse23.4623.3523.583.20x

An explicit EXLA smoke attempt on Qwen3-4B:

EMILY_BENCH_NEW_TOKENS=4 EMILY_BENCH_RUNS=1 EMILY_BENCH_WARMUP=0 \
  EMILY_BENCH_LANES=exla,emlx,emily-fuse \
  elixir bench/qwen3_4b_emily_vs_emlx.exs

loaded Qwen3-4B on EXLA.Backend as :bf16, then the process was killed with exit 137 during the EXLA compile/run. That is why the default 4B addendum is not used as the canonical three-way comparison on this 24 GB machine.

Recommendations

Use emily-fuse for autoregressive decode loops. It is best on Qwen3-0.6B and Qwen3-4B because the loop body is reused.

Use emily-native as the default best Emily lane for single-forward model benchmarks. It wins DistilBERT and ViT here; fuse can be neutral or slower when there is no repeated body to amortize.

Keep EXLA out of the default Qwen3-4B run on 24 GB machines. The EXLA 4B smoke was killed by the OS, while Qwen3-0.6B gives a completed three-way generation comparison.

Investigate Whisper separately with profiler traces. The model lowers, but its small-kernel shape is hostile to the current GPU path.