Current priority is numerical compatibility and predictable edge behavior.

Run the Python-reference comparison benchmark with:

cd reference/python
uv run python benchmark.py --mode quick

The benchmark generates deterministic sample data, times the pinned Python references, and then runs Statwise against the same data via reference/elixir/benchmark.exs. Results are reported as microseconds per operation using the best timed trial from several full repeat batches to reduce scheduler noise. Use --trials N to override the default trial count, --json-output benchmark_baseline_quick.json to refresh the tracked baseline, or --baseline benchmark_baseline_quick.json --fail-ratio 2.0 to check for regressions.

Implemented now:

  • Descriptive statistics accept one-dimensional Nx tensors.
  • Tensor-native descriptive reductions are available with backend: :tensor; the default tensor path still normalizes through the scalar implementation because it benchmarks faster on Nx.BinaryBackend for the current workloads.
  • List-backed descriptive statistics use direct Elixir reductions to avoid building Nx tensors for scalar results.
  • Ranking and Mann-Whitney U use ordinary Elixir control flow because tie grouping and exact distribution logic are easier to audit this way.
  • Mann-Whitney U computes rank sums and tie correction from one sorted pass, and caches exact U distributions by sample-size pair.
  • T-tests use scalar formulas after one-dimensional input normalization.
  • T-tests reuse single-pass sample summaries instead of recomputing mean/variance/standard error through repeated normalization.
  • Student's t quantiles stop bisection after double-precision convergence instead of running a fixed long iteration count.
  • Dataframe-style test APIs extract columns first, then reuse the same raw sample implementations.
  • Dataframe-style test APIs support input: :tensor, including Explorer.Series.to_tensor/2 when Explorer is loaded by the caller.

Before optimizing:

  • Benchmark list input versus tensor input.
  • Benchmark dataframe column extraction overhead separately from test computation.
  • Identify hot paths with representative sample sizes.
  • Preserve fixture compatibility before and after optimization.
  • Prefer Nx.Defn only when the algorithm maps cleanly to tensor operations.

Candidate future work:

  • Batched descriptive statistics with axis.
  • Batched one-sample and independent t-tests.
  • Faster ranking for very large Mann-Whitney samples.
  • Faster Student's t CDF approximations for t-test p-values.
  • Optional EXLA benchmarks to determine when backend: :tensor becomes a net win.