0.6.1 - 2026-06-11
Changed
- Rustler 0.38.0 is now used for source builds, and the
rustler_precompiledrequirement now targets 0.9.0. This keeps the NIF packaging stack current; the public Elixir API is unchanged. - Development-only Hex dependencies and compatible Rust lockfile dependencies were refreshed to their latest patch releases.
0.6.0 - 2026-06-10
Fixes every finding from the 2026-06 multi-agent Rust/STT NIF audit
(label stt-rust-audit): silent text loss, faster-whisper parity gaps
in the mel preprocessor and word alignment, and late or raising error
paths. Word timings now track the faster-whisper reference within one
encoder frame (20 ms).
Changed
- ct2rs 0.9.18 → 0.9.19, bumping the vendored CTranslate2 from 4.7.1 to 4.7.2. (#34)
- The daily security workflow audits the NIF crate's Rust dependency
tree with cargo-audit, alongside the existing
mix deps.audit. (#34)
Fixed
- Word text from
:word_timestampsis decoded through the tokenizer's byte-level BPE decoder. Non-ASCII words used to come back as mojibake ("schön" surfaced as "schön"), and codepoints split across tokens glued into one giant word; both now match faster-whisper, including per-codepoint word splitting for spaceless languages (zh, ja, th, lo, my, yue). (#19) - The last word of every 30 s chunk ends at the alignment's EOT boundary instead of a fabricated 20 ms duration. (#23)
- Fallback segment ends (unclosed timestamp pair, or
with_timestamps: false) are bounded by the chunk's real audio length; a 3 s clip no longer reports a segment ending at 30 s. (#24) - PCM containing NaN or infinity is rejected as
:invalid_requestinstead of silently transcribing the corrupted region as silence. Amplitudes that overflow the mel power are rejected the same way. (#20) - The 2 GiB mel-buffer cap is enforced from input sizes before the PCM copy and the mel chunks are allocated, not after. (#21)
WhisperCt2.available_devices/0runs on a dirty scheduler. On CUDA builds its first call initialises the NVIDIA driver, which used to stall a normal BEAM scheduler for the whole driver init. (#22)WhisperCt2.load_model/2fails at load when the tokenizer lacks<|startofprev|>, instead of degrading at inference time once:initial_promptis used. (#31)- The word-timestamp alignment prompt no longer carries an explicit
<|notimestamps|>— CTranslate2 appends it internally, so the decoder used to see the token doubled, perturbing the cross-attention word timings derive from relative to faster-whisper. (#25) - Text generated without an opening timestamp — a
:prefixecho, a fine-tune opening with text, or text between lone timestamps — is kept as its own segment instead of silently discarded. (#26) - Log-mel normalisation floors against the whole audio's maximum, as faster-whisper does, instead of per 30 s window; a near-silent window of a longer audio is no longer normalised against its own max. (#27)
- Reflect padding for audio shorter than 200 samples reads the zero-padded region like the reference instead of duplicating the last sample into the entire leading pad. (#28)
WhisperCt2.load_model/2validatespreprocessor_config.json: a zero numeric field or a mis-shapedmel_filtersmatrix fails as:load_errornaming the offending field, instead of an opaque:nif_panicat the first transcribe. (#29)- Integer options that overflow the NIF's fixed-width types (
u32/i32) are rejected as:invalid_requestinstead of raisingArgumentErrorat the NIF boundary. (#30)
0.5.0 - 2026-05-20
Initial public release. Native Elixir Whisper speech-to-text backed by
CTranslate2 through a Rustler NIF over ct2rs::sys::Whisper. No Python.
Features
WhisperCt2.load_model/2loads a CTranslate2-converted Whisper model directory and returns a%WhisperCt2.Model{}with resolved:deviceand:compute_type.WhisperCt2.transcribe/3accepts{:pcm_f32, binary}(mono, 16 kHz, little-endian f32) and returns a%WhisperCt2.Transcription{}whose:segmentscarry absolute start/end times,:no_speech_prob,:avg_logprob, the underlying token IDs, and optional per-word timing.WhisperCt2.transcribe_batch/3stacks every chunk of every input into one encoder forward pass - a large speedup for diarization-driven workflows with many short turns.:initial_promptand:prefixbias decoding;:word_timestampsadds a batched DTW alignment pass attaching%WhisperCt2.Word{}entries;:with_timestampstoggles<|t_..|>segment timestamps for plain-text fine-tunes.- English-only checkpoints (
*.en) use the[<|startoftranscript|>]prompt; multilingual checkpoints use[sot, lang, transcribe]. WhisperCt2.Pcm.slice/4carves sub-windows out of an already-decoded f32 buffer with loud bounds checking.WhisperCt2.available_devices/0reports CPU/CUDA device counts and the build's CUDA-support flag.- Structured
%WhisperCt2.Error{}taxonomy::invalid_request,:load_error,:inference_error,:runtime_error,:nif_panic,:native_error.
Backends
- Precompiled NIF artefacts via
rustler_precompiledforaarch64-apple-darwin(Accelerate),x86_64-unknown-linux-gnu(oneDNN, optionalmklvariant), andaarch64-unknown-linux-gnu(oneDNN). CUDA is loaded lazily viacuda-dynamicon every Linux artefact, so one binary runs on CPU-only and CUDA hosts alike. - Opt into a source build with
WHISPER_CT2_BUILD=1, or pick the MKL artefact on x86_64 Linux withWHISPER_CT2_VARIANT=mkl.