Main sections auto-generated by bench/emily_vs_exla.exs; Qwen3-4B addendum
appended from bench/qwen3_4b_emily_vs_emlx.exs. Lower is better for us/call;
higher is better for tokens/sec. Ratios compare the best Emily lane to the EXLA
and EMLX baselines.
Environment
| Field | Value |
|---|---|
| date | 2026-06-13 12:24 |
| host | Apple M4 Pro (BEAM total 116 MB at write) |
| elixir / otp | 1.19.5 / 28 |
| emily | 0.6.1 |
| emlx | 0.3.1 |
| exla | 0.12.0 (host (1 device)) |
| nx | 0.12.1 |
| smoke run? | false |
Tier 1 — op microbenchmarks (mean us/call)
Ratios are best Emily / baseline; <1 means Emily is faster.
| op | size | exla (CPU) | emlx (GPU) | emily-eager | emily-native | emily-fuse | best-emily/exla | best-emily/emlx |
|---|---|---|---|---|---|---|---|---|
| add | 256 | 112.0 | 343.6 | 276.9 | 296.7 | 255.2 | 2.28x | 0.74x |
| add | 1024 | 258.7 | 417.4 | 423.6 | 491.8 | 486.3 | 1.64x | 1.02x |
| add | 4096 | 2972.8 | 1536.0 | 1489.3 | 1523.0 | 1545.2 | 0.5x | 0.97x |
| mul | 256 | 107.0 | 324.1 | 247.2 | 228.9 | 220.1 | 2.06x | 0.68x |
| mul | 1024 | 242.1 | 438.4 | 418.0 | 489.5 | 480.2 | 1.73x | 0.95x |
| mul | 4096 | 2764.0 | 1542.5 | 1639.6 | 1461.4 | 1478.5 | 0.53x | 0.95x |
| exp | 256 | 134.2 | 297.1 | 219.1 | 246.0 | 211.1 | 1.57x | 0.71x |
| exp | 1024 | 381.4 | 467.7 | 420.1 | 517.2 | 515.6 | 1.1x | 0.9x |
| exp | 4096 | 3634.1 | 1223.5 | 1231.6 | 1140.6 | 1234.7 | 0.31x | 0.93x |
| sum | 256 | 94.0 | 223.4 | 176.4 | 194.1 | 203.1 | 1.88x | 0.79x |
| sum | 1024 | 203.4 | 351.0 | 331.9 | 315.6 | 355.0 | 1.55x | 0.9x |
| sum | 4096 | 1882.1 | 649.2 | 776.9 | 714.3 | 817.8 | 0.38x | 1.1x |
| softmax | 256 | 164.2 | 416.8 | 283.9 | 246.8 | 295.3 | 1.5x | 0.59x |
| softmax | 1024 | 503.8 | 786.0 | 758.1 | 743.5 | 721.0 | 1.43x | 0.92x |
| softmax | 4096 | 5433.8 | 2891.4 | 2889.6 | 2771.8 | 2244.5 | 0.41x | 0.78x |
| matmul | 128 | 106.5 | 215.1 | 191.8 | 200.8 | 204.8 | 1.8x | 0.89x |
| matmul | 512 | 459.6 | 692.0 | 642.3 | 515.5 | 643.2 | 1.12x | 0.74x |
| matmul | 1024 | 2577.2 | 866.2 | 937.4 | 1028.0 | 801.9 | 0.31x | 0.93x |
| matmul | 2048 | 17864.3 | 3578.4 | 3557.2 | 3593.9 | 3582.5 | 0.2x | 0.99x |
Tier 2 — DistilBERT QA (mean us/call)
| lane | us/call |
|---|---|
| exla (CPU) | 8985.0 |
| emlx (GPU) | 19194.3 |
| emily-eager (GPU) | 16452.7 |
| emily-native (GPU) | 7058.7 |
| emily-fuse (GPU) | 8504.3 |
best Emily lane vs EXLA (>1 = Emily faster): 1.27x
best Emily lane vs EMLX (>1 = Emily faster): 2.72x
Tier 3 — Qwen3-0.6B decode (tokens/sec)
| lane | tok/s |
|---|---|
| exla (CPU) | 39.84 |
| emlx (GPU) | 11.42 |
| emily-eager (GPU) | 12.51 |
| emily-native (GPU) | 59.96 |
| emily-fuse (GPU) | 66.42 |
best Emily lane vs EXLA (>1 = Emily faster): 1.67x
best Emily lane vs EMLX (>1 = Emily faster): 5.82x
Tier 4 — ViT-base image classification (mean us/call)
| lane | us/call |
|---|---|
| exla (CPU) | 56191.3 |
| emlx (GPU) | ERR |
| emily-eager (GPU) | 38780.3 |
| emily-native (GPU) | 23934.3 |
| emily-fuse (GPU) | 27498.0 |
best Emily lane vs EXLA (>1 = Emily faster): 2.35x
best Emily lane vs EMLX (>1 = Emily faster): —
Tier 5 — Whisper-tiny transcription (mean us/call)
| lane | us/call |
|---|---|
| exla (CPU) | 88488.7 |
| emlx (GPU) | ERR |
| emily-eager (GPU) | 1815796.0 |
| emily-native (GPU) | 961683.7 |
| emily-fuse (GPU) | 981888.0 |
best Emily lane vs EXLA (>1 = Emily faster): 0.09x
best Emily lane vs EMLX (>1 = Emily faster): —
Addendum — Qwen3-4B decode (GPU-focused)
Manual addendum from bench/qwen3_4b_emily_vs_emlx.exs, rerun after the main
three-way suite. Higher is better. The script has an explicit EXLA lane, but the
safe default for Qwen3-4B on this 24 GB M4 Pro is GPU-only.
| lane | mean tok/s | min | max | speedup vs EMLX |
|---|---|---|---|---|
| emlx (GPU) | 7.33 | 7.31 | 7.37 | 1.00x |
| emily-eager (GPU) | 8.03 | 7.98 | 8.08 | 1.10x |
| emily-native (GPU) | 22.27 | 22.16 | 22.32 | 3.04x |
| emily-fuse (GPU) | 23.46 | 23.35 | 23.58 | 3.20x |
Best Emily lane vs EMLX: emily-fuse at 23.46 tok/s, 3.20x faster.
Qwen3-4B EXLA note
An explicit EXLA smoke attempt was run with:
EMILY_BENCH_NEW_TOKENS=4 EMILY_BENCH_RUNS=1 EMILY_BENCH_WARMUP=0 EMILY_BENCH_LANES=exla,emlx,emily-fuse elixir bench/qwen3_4b_emily_vs_emlx.exs
It loaded Qwen/Qwen3-4B on EXLA.Backend as :bf16 in 4.86 s, then the
process was killed with exit 137 during the EXLA compile/run. No Qwen3-4B EXLA
throughput number is reported; bench/emily_vs_exla.exs uses Qwen3-0.6B for the
canonical completed three-way generation comparison.