Emily vs EMLX vs EXLA — raw benchmark results

Copy Markdown View Source

Main sections auto-generated by bench/emily_vs_exla.exs; Qwen3-4B addendum appended from bench/qwen3_4b_emily_vs_emlx.exs. Lower is better for us/call; higher is better for tokens/sec. Ratios compare the best Emily lane to the EXLA and EMLX baselines.

Environment

FieldValue
date2026-06-13 12:24
hostApple M4 Pro (BEAM total 116 MB at write)
elixir / otp1.19.5 / 28
emily0.6.1
emlx0.3.1
exla0.12.0 (host (1 device))
nx0.12.1
smoke run?false

Tier 1 — op microbenchmarks (mean us/call)

Ratios are best Emily / baseline; <1 means Emily is faster.

opsizeexla (CPU)emlx (GPU)emily-eageremily-nativeemily-fusebest-emily/exlabest-emily/emlx
add256112.0343.6276.9296.7255.22.28x0.74x
add1024258.7417.4423.6491.8486.31.64x1.02x
add40962972.81536.01489.31523.01545.20.5x0.97x
mul256107.0324.1247.2228.9220.12.06x0.68x
mul1024242.1438.4418.0489.5480.21.73x0.95x
mul40962764.01542.51639.61461.41478.50.53x0.95x
exp256134.2297.1219.1246.0211.11.57x0.71x
exp1024381.4467.7420.1517.2515.61.1x0.9x
exp40963634.11223.51231.61140.61234.70.31x0.93x
sum25694.0223.4176.4194.1203.11.88x0.79x
sum1024203.4351.0331.9315.6355.01.55x0.9x
sum40961882.1649.2776.9714.3817.80.38x1.1x
softmax256164.2416.8283.9246.8295.31.5x0.59x
softmax1024503.8786.0758.1743.5721.01.43x0.92x
softmax40965433.82891.42889.62771.82244.50.41x0.78x
matmul128106.5215.1191.8200.8204.81.8x0.89x
matmul512459.6692.0642.3515.5643.21.12x0.74x
matmul10242577.2866.2937.41028.0801.90.31x0.93x
matmul204817864.33578.43557.23593.93582.50.2x0.99x

Tier 2 — DistilBERT QA (mean us/call)

laneus/call
exla (CPU)8985.0
emlx (GPU)19194.3
emily-eager (GPU)16452.7
emily-native (GPU)7058.7
emily-fuse (GPU)8504.3

best Emily lane vs EXLA (>1 = Emily faster): 1.27x

best Emily lane vs EMLX (>1 = Emily faster): 2.72x

Tier 3 — Qwen3-0.6B decode (tokens/sec)

lanetok/s
exla (CPU)39.84
emlx (GPU)11.42
emily-eager (GPU)12.51
emily-native (GPU)59.96
emily-fuse (GPU)66.42

best Emily lane vs EXLA (>1 = Emily faster): 1.67x

best Emily lane vs EMLX (>1 = Emily faster): 5.82x

Tier 4 — ViT-base image classification (mean us/call)

laneus/call
exla (CPU)56191.3
emlx (GPU)ERR
emily-eager (GPU)38780.3
emily-native (GPU)23934.3
emily-fuse (GPU)27498.0

best Emily lane vs EXLA (>1 = Emily faster): 2.35x

best Emily lane vs EMLX (>1 = Emily faster): —

Tier 5 — Whisper-tiny transcription (mean us/call)

laneus/call
exla (CPU)88488.7
emlx (GPU)ERR
emily-eager (GPU)1815796.0
emily-native (GPU)961683.7
emily-fuse (GPU)981888.0

best Emily lane vs EXLA (>1 = Emily faster): 0.09x

best Emily lane vs EMLX (>1 = Emily faster): —

Addendum — Qwen3-4B decode (GPU-focused)

Manual addendum from bench/qwen3_4b_emily_vs_emlx.exs, rerun after the main three-way suite. Higher is better. The script has an explicit EXLA lane, but the safe default for Qwen3-4B on this 24 GB M4 Pro is GPU-only.

lanemean tok/sminmaxspeedup vs EMLX
emlx (GPU)7.337.317.371.00x
emily-eager (GPU)8.037.988.081.10x
emily-native (GPU)22.2722.1622.323.04x
emily-fuse (GPU)23.4623.3523.583.20x

Best Emily lane vs EMLX: emily-fuse at 23.46 tok/s, 3.20x faster.

Qwen3-4B EXLA note

An explicit EXLA smoke attempt was run with:

EMILY_BENCH_NEW_TOKENS=4 EMILY_BENCH_RUNS=1 EMILY_BENCH_WARMUP=0 EMILY_BENCH_LANES=exla,emlx,emily-fuse elixir bench/qwen3_4b_emily_vs_emlx.exs

It loaded Qwen/Qwen3-4B on EXLA.Backend as :bf16 in 4.86 s, then the process was killed with exit 137 during the EXLA compile/run. No Qwen3-4B EXLA throughput number is reported; bench/emily_vs_exla.exs uses Qwen3-0.6B for the canonical completed three-way generation comparison.