OmnivoiceEx (omnivoice_ex v0.2.1)

Copy Markdown View Source

Elixir wrapper for OmniVoice — a unified speech generation model from K2-FSA.

Voice Cloning · Voice Design (instruction-based) · Multilingual · 24kHz output.

Features

  • 🎤 Voice Cloning — Clone any voice from a reference audio clip
  • 🎨 Voice Design — Describe a voice in natural language and generate it
  • 🌍 Multilingual — Supports multiple languages with automatic detection
  • Fast Inference — Optimized for GPU (CUDA/MPS) and CPU
  • 🔊 24kHz WAV Output — Studio-quality audio

Protocol

OmnivoiceEx uses MessagePack over binary-framed Erlang Ports. Audio is transmitted as WAV bytes inside msgpack — no base64 overhead.

Quick Start

{:ok, pid} = OmnivoiceEx.start_link(device: "cuda")
:ok = OmnivoiceEx.await_ready(pid)
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello, world!")
:ok = OmnivoiceEx.save(audio, "output.wav")

Voice Design (instruction-based)

{:ok, audio} = OmnivoiceEx.generate(pid, "Welcome to our service!",
  instruct: "A warm, professional female broadcaster"
)

Voice Cloning

{:ok, audio} = OmnivoiceEx.generate(pid, "This is my voice clone!",
  ref_audio: "/path/to/reference.wav",
  ref_text: "The transcript of the reference audio"
)

Requirements

  • Python ≥ 3.10, omnivoice + msgpack + numpy + soundfile pip packages
  • CUDA GPU, Apple Silicon (MPS), or CPU
  • Elixir ≥ 1.14

Installation

# mix.exs
{:omnivoice_ex, "~> 0.1.0"}

# Install Python deps
mix omnivoice_ex.setup

Summary

Functions

Waits for the model to finish loading. Returns :ok when ready.

Generates speech audio from text. Returns {:ok, audio_wav}.

Returns runtime model information.

Saves audio binary to a WAV file.

Starts an OmniVoice model server.

Gracefully stops the server and Python bridge.

Types

audio()

@type audio() :: binary()

generate_opt()

@type generate_opt() ::
  {:ref_audio, String.t()}
  | {:ref_text, String.t()}
  | {:instruct, String.t()}
  | {:language, String.t()}
  | {:duration, float()}
  | {:speed, float()}
  | {:num_step, pos_integer()}
  | {:guidance_scale, float()}
  | {:seed, non_neg_integer()}
  | {:position_temperature, float()}
  | {:class_temperature, float()}

Functions

await_ready(server, timeout \\ 120_000)

@spec await_ready(GenServer.server(), timeout()) :: :ok | {:error, term()}

Waits for the model to finish loading. Returns :ok when ready.

generate(server, text, opts \\ [])

@spec generate(GenServer.server(), String.t(), [generate_opt()]) ::
  {:ok, audio()} | {:error, term()}

Generates speech audio from text. Returns {:ok, audio_wav}.

Options

  • :ref_audio — Path to reference audio for voice cloning
  • :ref_text — Transcript of reference audio (improves clone quality)
  • :instruct — Voice instruction for design (e.g. "A warm female broadcaster")
  • :language — OmniVoice language ID (e.g. "zh", "en", "ja", "ko", "yue"). Auto-detected from text if omitted. For mixed-language content, set this explicitly to avoid unstable detection.

Common IDs: zh (Chinese), en (English), ja (Japanese), ko (Korean), yue (Cantonese), fr (French), de (German), es (Spanish), ru (Russian), pt (Portuguese), it (Italian), th (Thai), vi (Vietnamese), hi (Hindi), ar (Arabic), nl (Dutch), pl (Polish), sv (Swedish), tr (Turkish).

Full list of 646 languages: OmniVoice docs/languages.md

  • :duration — Target duration in seconds
  • :speed — Playback speed factor
  • :num_step — Diffusion steps (higher = better quality, slower). Default: 32
  • :guidance_scale — Classifier-free guidance. Default: 2.0
  • :seed — Random seed for reproducible generation. Default: 42.
  • :position_temperature — Mask-position selection temperature. 0 = greedy (deterministic). Default: 0.0.
  • :class_temperature — Token sampling temperature. 0 = greedy (deterministic). Default: 0.0.

Examples

# Basic
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello!")
:ok = OmnivoiceEx.save(audio, "out.wav")

# Voice Design
{:ok, audio} = OmnivoiceEx.generate(pid,
  "Welcome to the show!",
  instruct: "A deep, authoritative male narrator"
)

# Voice Cloning
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello in my voice!",
  ref_audio: "/path/to/ref.wav",
  ref_text: "This is my reference transcript"
)

# Quality tuning
{:ok, audio} = OmnivoiceEx.generate(pid, "High quality speech.",
  num_step: 64, guidance_scale: 3.0
)

generate(server, text, opts, timeout)

@spec generate(GenServer.server(), String.t(), [generate_opt()], timeout()) ::
  {:ok, audio()} | {:error, term()}

See OmnivoiceEx.Server.generate/4.

info(server)

@spec info(GenServer.server()) :: map()

Returns runtime model information.

save(audio, path)

@spec save(audio(), Path.t()) :: :ok | {:error, term()}

Saves audio binary to a WAV file.

start_link(opts)

Starts an OmniVoice model server.

Options

  • :model — HuggingFace model ID. Default: "k2-fsa/OmniVoice"
  • :device"cuda", "cpu", "mps". Default: "cuda"
  • :dtype"float16", "float32", "bfloat16". Default: "float16"
  • :name — Optional GenServer name

stop(server)

@spec stop(GenServer.server()) :: :ok

Gracefully stops the server and Python bridge.