FastDecimal benchmarks

Copy Markdown View Source

Every benchmark in this directory is reproducible. Each script is self-contained — just mix run bench/<name>.exs.

The single-command experience: mix bench runs summary.exs, which produces one consolidated table covering 22 scenarios. mix bench.all runs the entire suite end-to-end.

Files

FileWhat it measuresWhen to run
summary.exsHeadline table: every op × {medium, large}, geometric-mean speedup. The bench you'd cite.Quick "is FastDecimal still ahead?" check after a change
arithmetic.exsPer-op times vs decimal and Compat shim at 3 value sizes (full benchee output)Detailed view of add/sub/mult/div/etc with deviation
division.exsdiv / div_int / div_rem / rem × sizes × precisions (5, 10, 28, 50)Verifying division performance, picking a precision
rounding.exsround/3 with all 7 modes × 5 input shapesComparing rounding-mode overhead
sqrt.exssqrt/2 at precisions 5 / 10 / 28 / 50 / 100 / 200Tuning sqrt-precision for your use case
conversion.exsto_string (all formats) + to_integer + to_float + cast from various input typesFormat-conversion path tuning
special_values.exsNaN/Inf op overhead vs finite arithmetic; predicate costConfirming the is_integer guard isn't hurting the fast path
batch.exssum/1 / product/1 over 10/100/1k/10k elements: pure Elixir vs decimal reduceVerifying batch wins
realistic.exs5 fintech-style workloads: invoice total, tax+discount, FX conversion, aggregate stats, CSV ingestionThe "would this actually help my app?" answer
parse.exsTwo parse strategies (char-walk vs split+binary_to_integer) across 9 inputsChose the public parse/1 impl
representation.exs%FastDecimal{} struct vs raw {coef, exp} tuple vs raw integersQuantifying the struct wrapper cost
profile.exsTime + bytes allocated + reductions per op (richer than benchee alone)Investigating why an op is fast or slow
disasm.exsDumps BEAM bytecode for hot-path functions; counts put_map_assoc vs put_map_exact module-wideInspecting what BEAMAsm actually JITs from

Run individual ones with mix run bench/<name>.exs. Pipe to tee if you want a record.

Methodology

The headline mix bench (= bench/summary.exs) uses a custom harness in bench/support.exs:

  • 7 independent samples per scenario per implementation. Each sample runs in a fresh process (resetting BEAM state — register cache, ETS access patterns, scheduler binding) and includes 1,000 warmup iterations to let the JIT specialize the call site before measurement starts.
  • 200,000 iterations per sample, timed with :erlang.monotonic_time(:nanosecond) (true nanosecond resolution; :timer.tc is microsecond).
  • A forced :erlang.garbage_collect() before each sample to start from a clean heap.
  • Process priority set :high to bias the macOS scheduler toward performance cores when the system isn't loaded.
  • We report median (p25–p75 IQR) per scenario. The IQR is the trustworthy headline number — it survives outliers from GC pauses, scheduler steals, and core swaps.
  • A row is marked stable when even FastDecimal's p75 vs Decimal's p25 (the pessimistic IQR-edge ratio) clears 2×.

What we report vs what we don't

  • Speedup column shows three numbers: median (pessimistic_ratio – optimistic_ratio). The middle number is what you'd see most days; the outer numbers bound what you might see on a noisy/quiet run.
  • No mean reported. Means on sub-µs ops are dominated by 99th-percentile outliers from GC; medians are robust.
  • profile.exs shows allocation estimates approximated from heap_size_delta + (gc_count × heap_size_before_gc). This is a heuristic, not a precise measurement. Use it as a relative signal between implementations, not an absolute byte count. The "fastdec uses less garbage" claim holds across implementations but the exact byte numbers should be taken with a grain of salt.

Reproducibility check

Across multiple consecutive runs on macOS arm64 / OTP 26 / BEAMAsm JIT, the geometric-mean speedup has landed in the range 9.67× – 10.01×. Each scenario's specific median value drifts 5–10% per run because of macOS scheduler noise and GC interactions; the ratios are stable. The headline number is best read as "approximately 10×, varies by ±3% across runs."

JIT vs non-JIT comparison

Same hardware, same Elixir, only the Erlang differs (asdf 26.0.2 with emu_flavor=jit vs 26.2.4 with emu_flavor=emu):

EmulatorGeomeanFaster scenariosStable ≥2×
BEAMAsm JIT (26.0.2)9.87×21/2219/22
Threaded-code interpreter (26.2.4)7.71×21/2218/22

JIT helps FastDecimal proportionally more than decimal (tighter hot paths = more inlining benefit), so the speedup ratio is bigger on JIT. But the lib still wins by ~8× on non-JIT BEAM.

Noise sources we acknowledge

  • macOS efficiency cores. Apple Silicon schedules across heterogeneous cores (P + E). Long enough measurements average over both. Short ones don't.
  • GC pauses. Some scenarios trigger a GC inside the measurement window. With 7 samples × 200k iters this washes out except for the smallest workloads.
  • JIT detection. :erlang.system_info(:emu_flavor) is the right field for this (:jit vs :emu). :emu_type is :opt for both JIT and non-JIT builds — I confused these initially and the early bench output incorrectly labeled the machine as "no JIT".

Reading benchee output (other bench files)

Benchee reports four numbers per scenario:

  • ips — iterations per second (higher = faster)
  • average — mean ns/op
  • deviation — relative std-dev as %
  • median — middle value, the most trustworthy single number on noisy hardware
  • 99th % — 99th-percentile latency

For sub-µs operations on macOS, trust the median, not the mean. Deviation in the thousands of percent is the BEAM/Mac scheduler interaction, not the code under test.

profile.exs adds three more numbers per op:

  • bytes/op — heap bytes allocated (heuristic — see Methodology above)
  • reductions/op — BEAM's virtual instruction counter. A useful "how much work" proxy; lower = less work
  • GCs — minor garbage collections triggered in the measurement loop

Design decisions per op (with evidence)

Every choice below was made because we measured. Where the obvious "fast" technique lost, we kept the slower-looking code that actually won.

add / sub / mult

  • Public API: struct-based (%FastDecimal{coef, exp}).
  • Measured: tuple {coef, exp} is 5–9 % faster on mult, parity on add (representation.exs). Cost is small enough that we pay it for the Inspect protocol, pattern-matching ergonomics, and zero-cost guarantees that hot paths still stay heap-light.
  • No explicit small-int guards. BEAM's JIT already specializes integer math based on whether values fit in the 60-bit immediate range. Tested an explicit when c in -2^60..2^60 guard — added ~3 % overhead with no measurable upside.

div

  • Pure-Elixir, integer-shift-and-divide with explicit rounding modes.
  • Allocates more than decimal (~91 B vs ~8 B) because we materialize the shifted dividend as an intermediate bignum. We accept that — wall time is 5–9× faster than decimal (profile.exs).
  • A Rust NIF wrapping a native decimal library is faster on div at precision 28 (~2.5× our pure-Elixir path, measured during prototyping). Not adopted because:
    1. It would force a binary dependency on every user
    2. It only wins on this one op
    3. Pure Elixir is still 5× faster than decimal here

compare

  • At the BEAM floor: 4 reductions, 0 bytes allocated (profile.exs).
  • No optimization possible — it's pattern match + one integer compare + return an atom.

parse

  • Public API uses the char-by-char walker (Parser.parse_walk/1).
  • Measured: walker is 1.4–3× faster than the split-and-binary_to_integer approach on every input shorter than ~25 digits (parse.exs).
  • Split wins ONLY on 30+ digit pure-integer strings, where :erlang.binary_to_integer/1's native C loop overtakes BEAM's pattern dispatch.
  • Allocates ~186 B per call (sub-binary refs from the recursive head/tail pattern). We accept that — wall time is 3–5× faster than decimal, and decimal allocates 3× more anyway.

to_string

  • Public API uses IO.iodata_to_binary([sign, int_part, ?., frac_part]).
  • Measured: a bit-syntax <<sign::binary, int_part::binary, ?., frac_part::binary>> alternative was 20 % slower despite allocating less. iodata_to_binary is a NIF that pre-computes total size and allocates once; that beats stepwise binary appends.
  • Allocates ~266 B (7× more than decimal's ~36 B) for parity wall time. Tried direct bit-syntax to reduce alloc; that path lost on time. Picking time > alloc here.

Map construction op: put_map_assoc vs put_map_exact

  • Kept the %__MODULE__{coef: ..., exp: ...} literal-struct form for all hot paths.
  • Tried: rewriting to %{a | coef: c1 + c2} (Elixir's "strict update") so the BEAM compiler emits put_map_exact (updates an existing map's keys) instead of put_map_assoc (allocates a fresh map). The bytecode change happened cleanly — module-wide counts flipped from 20:1 put_map_assoc:put_map_exact to 7:14, verifiable in bench/disasm.exs.

  • Measured: wall-time unchanged (medians 42 ns both before and after), and the head-to-head head-to-head benchee comparison actually showed put_map_exact was 1 % slower within noise.
  • Conclusion: BEAMAsm has equally tight implementations of both ops for tiny structs. The literal-struct form is also more readable — every field value is right there at the call site — so the code stays.

sum/1 / product/1

  • Pure-Elixir reduce via tail-recursive helpers.
  • Measured during prototyping: a NIF that received the entire list across the FFI boundary was 6–7× SLOWER at every list size tested (10 / 100 / 1k / 10k elements).
  • Why: Rustler's Vec<Tuple> decode marshals each list element across BEAM/Rust (~50–100 ns each). At BEAM's ~11 ns per pure-Elixir add, the marshalling cost can't be recovered.
  • Pure-Elixir batch is 26–30× faster than decimal reduce — and crucially, no Rust toolchain needed at install time. The NIF prototype was removed before v1.0; this lesson is preserved.

Reproducibility notes

  • Hardware matters. All numbers in the project README are from an M-series Mac on Erlang/OTP 26 with BEAMAsm enabled. Linux on x86 will differ — typically lower absolute numbers but similar ratios.
  • GC noise. If you see a single scenario with 99th-%ile 100× the median, that's a GC cycle landing inside the measurement window. Re-run to confirm the median.
  • Warmup matters. Each Benchee scenario uses warmup: 0.5 to let the JIT specialize and load caches; time: 2 for the actual measurement window. Don't shorten these — short runs amplify noise.
  • Benchee vs profile.exs. profile.exs uses a tight :timer.tc loop without Benchee's per-iteration overhead, so its ns/op numbers are typically lower than arithmetic.exs's. Both are correct, measuring slightly different things.

When you find a regression

If a change makes any op slower, the contract is: include the bench output in the PR description showing it. If a change makes an op faster, same — show the before/after numbers. The point of having all the benches in-tree is so this conversation is short.