Native Elixir bindings for whisper.cpp.
A thin wrapper around the
whisper-rs crate, calling
whisper.cpp's C API through a Rustler NIF. No whisper-cli subprocess,
no Python, no temporary files. Structured per-segment results,
:initial_prompt biasing, word-level timestamps, and CUDA / ROCm
(hipBLAS) / Metal / CPU backends.
Quickstart
{:ok, model} = WhisperCpp.load_model("models/ggml-large-v3.bin")
{:ok, %WhisperCpp.Transcription{text: text, segments: segs}} =
WhisperCpp.transcribe(model, {:pcm_f32, samples}, language: "en")
IO.puts(text)
for s <- segs, do: IO.puts("[#{s.start}-#{s.end}] #{s.text}")Audio contract
transcribe/3 accepts exactly one input shape:
{:pcm_f32, binary()}where binary is little-endian IEEE-754 f32 samples, mono, 16 kHz,
normalised to [-1.0, 1.0]. Decode audio file formats (WAV, MP3,
FLAC, M4A, Opus, ...) upstream with ffmpeg or similar:
ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 - | …Use transcribe_slice/4 to transcribe a [start_s, end_s) window of an
already-decoded master PCM buffer; the returned segment / word times
are shifted back into the original audio timeline.
Summary
Types
Audio input accepted by transcribe/3.
Options accepted by load_model/2.
Options accepted by transcribe/3 / transcribe_slice/4.
Functions
Reports the runtime backends compiled into this NIF artefact.
Loads a GGUF or GGML whisper.cpp model file.
Transcribes audio using model.
Transcribes a [start_s, end_s) slice of samples and shifts the
returned segment/word timestamps to absolute seconds in the original
audio.
Types
@type audio() :: {:pcm_f32, binary()}
Audio input accepted by transcribe/3.
@type load_opt() :: {:device, WhisperCpp.Model.device() | :auto} | {:use_gpu, boolean()}
Options accepted by load_model/2.
@type transcribe_opt() :: {:language, String.t() | nil} | {:translate, boolean()} | {:initial_prompt, String.t() | nil} | {:word_timestamps, boolean()} | {:beam_size, pos_integer()} | {:best_of, pos_integer()} | {:temperature, float()} | {:n_threads, pos_integer()} | {:n_max_text_ctx, non_neg_integer()} | {:offset_ms, non_neg_integer()} | {:duration_ms, non_neg_integer()} | {:no_speech_thold, float()} | {:logprob_thold, float()} | {:suppress_blank, boolean()} | {:suppress_non_speech_tokens, boolean()} | {:single_segment, boolean()} | {:print_progress, boolean()} | {:abort_handle, WhisperCpp.AbortHandle.t() | nil} | {:progress_pid, pid() | nil}
Options accepted by transcribe/3 / transcribe_slice/4.
Functions
@spec available_devices() :: {:ok, %{backends: [atom()], gpu_supported: boolean()}} | {:error, WhisperCpp.Error.t()}
Reports the runtime backends compiled into this NIF artefact.
Returns {:ok, %{backends: [...], gpu_supported: bool}}. The
backends list reflects compile-time cargo features (e.g.
[:cpu, :cuda] on a WHISPER_CPP_VARIANT=cuda build).
Build a source artefact with GPU support via:
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=cuda mix compile # NVIDIA
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=hipblas mix compile # AMD ROCm
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=vulkan mix compile # cross-vendor
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=metal mix compile # Apple Silicon
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=coreml mix compile # Apple ANE
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=intel-sycl mix compile # Intel Arc/Xe
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=openblas mix compile # CPU + OpenBLAS
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=openmp mix compile # CPU + OpenMPPick one accelerator per build; the backend is baked into the artefact.
@spec load_model(Path.t(), [load_opt()]) :: {:ok, WhisperCpp.Model.t()} | {:error, WhisperCpp.Error.t()}
Loads a GGUF or GGML whisper.cpp model file.
Pass a path to a .bin (legacy GGML) or .gguf file. Download official
weights from https://huggingface.co/ggerganov/whisper.cpp.
Options
:device- one of:cpu,:cuda,:hipblas,:vulkan,:metal,:coreml,:intel_sycl, or:auto(default).:autopicks the GPU backend when the artefact was built with one; otherwise CPU. Requesting a backend that was not compiled in returns{:error, %WhisperCpp.Error{reason: :invalid_request}}.:use_gpu- shortcut:falseforcesdevice: :cpu. Defaulttrue.
@spec transcribe(WhisperCpp.Model.t(), audio(), [transcribe_opt()]) :: {:ok, WhisperCpp.Transcription.t()} | {:error, WhisperCpp.Error.t()}
Transcribes audio using model.
Returns {:ok, %WhisperCpp.Transcription{}} whose :segments carry
absolute start/end times, no_speech_prob, avg_logprob, the
underlying text tokens, and (when :word_timestamps is set) per-word
timing.
Options
:language- ISO code ("en").nil(default) auto-detects on multilingual models; auto-detect on monolingual models always returns"en".:translate- translate to English instead of transcribing.:initial_prompt- free-text context prepended via<|startofprev|>to bias decoding (max ~224 tokens).:word_timestamps- attach per-word timing. Defaultfalse.:beam_size- beam-search width. Default5.:best_of- greedy candidates kept whenbeam_size <= 1.:temperature- sampling temperature (0.0= greedy/beam).:n_threads- intra-op threads. Default4.:n_max_text_ctx- cap decoder context tokens.:offset_ms,:duration_ms- clip the audio window.:no_speech_thold- silence detection threshold. Default0.6.:logprob_thold- reject segments withavg_logprobbelow this.:suppress_blank,:suppress_non_speech_tokens- decoder suppressions.:single_segment- force a single segment for the whole audio.:print_progress- whisper.cpp progress to stderr.:abort_handle-%WhisperCpp.AbortHandle{}whoseabort/1cancels in-flight inference. The call returns{:ok, partial_transcription}with whatever segments completed before the abort took effect.:progress_pid- pid that receives{:whisper_progress, percent}messages (0..100) as work advances; duplicate percentages are coalesced.
@spec transcribe_slice(WhisperCpp.Model.t(), binary(), {number(), number()}, [ transcribe_opt() ]) :: {:ok, WhisperCpp.Transcription.t()} | {:error, WhisperCpp.Error.t()}
Transcribes a [start_s, end_s) slice of samples and shifts the
returned segment/word timestamps to absolute seconds in the original
audio.
Slices the f32 PCM buffer, runs whisper.cpp on the slice, and rewrites
local segment times back into the absolute timeline. Returns
{:ok, %Transcription{}} with absolute timings, or
{:error, Error.t()}. Slices shorter than 0.3 s return an empty
transcription (whisper.cpp pads short inputs and hallucinates into the
padding).