LLM management for Candil.
Provides a unified interface for running, managing and querying large language
models — whether they run locally via llama.cpp or remotely via an API
provider such as OpenAI, Anthropic or Ollama.
Concepts
Engine
An engine is a local llama-server binary that serves one model at a time
over an OpenAI-compatible HTTP API. You can use a pre-existing binary on the
machine or let Apero download the official precompiled release from the
llama.cpp releases page.
Provider
A provider is a remote HTTP API (OpenAI, Anthropic, Ollama, or any OpenAI-compatible endpoint). Ollama is treated as a remote provider because it manages its own process and model storage independently.
Model
A model is either:
- Local — a
.gguffile on disk, associated with an engine. - Remote — a model name / ID offered by a provider.
Lifecycle — local model
# 1. Define engine and model
engine = %Candil.Engine{
alias: :llama_server,
binary_dir: "/usr/local/bin",
use_precompiled: true,
precompiled_version: :latest,
start_args: ["--host", "127.0.0.1"]
}
model = %Candil.Model{
alias: :llama3,
type: :local,
model_dir: "/models",
filename: "llama-3-8b-q4_k_m.gguf",
download_url: "https://huggingface.co/.../llama-3-8b-q4_k_m.gguf",
context_size: 8192,
engine: :llama_server,
usage: [:chat, :completion],
model_args: ["--n-gpu-layers", "35"]
}
# 2. (Optional) download binary and model
:ok = Candil.download_engine(engine)
{:ok, _} = Candil.download_model(model)
# 3. Start engine serving the model
{:ok, _pid} = Candil.start_engine(engine, model)
# 4. Run inference
{:ok, response} = Candil.chat(:llama3, [
%{role: "user", content: "Hello!"}
])
# 5. Stop engine
:ok = Candil.stop_engine(:llama3)Lifecycle — remote model
provider = %Candil.Provider{
alias: :openai,
type: :openai,
base_url: "https://api.openai.com",
api_key: System.get_env("OPENAI_API_KEY")
}
model = %Candil.Model{
alias: :gpt4o,
type: :remote,
name: "gpt-4o",
context_size: 128_000,
provider: :openai,
usage: [:chat, :completion, :embeddings]
}
{:ok, response} = Candil.chat(:gpt4o, provider, [
%{role: "user", content: "Hello!"}
])
Summary
Functions
Runs a chat completion against a local model (identified by alias).
Runs a chat completion against a remote model via a provider.
Downloads the appropriate precompiled llama.cpp binary for this engine.
Downloads a local model file to model.model_dir.
Runs an embeddings request against a local model.
Runs an embeddings request against a remote model.
Returns true if an engine serving the given model alias is running and
responding to health checks.
Starts a local llama-server engine loaded with model.
Stops a running engine identified by the model alias.
Streams a chat completion from a local engine.
Streams a chat completion from a remote provider.
Functions
@spec chat(atom(), [Candil.Inference.message()], keyword()) :: {:ok, Candil.Inference.response()} | {:error, any()}
Runs a chat completion against a local model (identified by alias).
The engine must already be running via start_engine/2.
Options
:temperature— sampling temperature (default:0.7):max_tokens— maximum tokens to generate (default:512):stop— list of stop sequences
@spec chat( Candil.Model.t(), Candil.Provider.t(), [Candil.Inference.message()], keyword() ) :: {:ok, Candil.Inference.response()} | {:error, any()}
Runs a chat completion against a remote model via a provider.
Options
Same as chat/3.
@spec download_engine(Candil.Engine.t()) :: :ok | {:error, binary()}
Downloads the appropriate precompiled llama.cpp binary for this engine.
Detects the current OS, architecture and GPU automatically. Does nothing
if use_precompiled is false.
@spec download_model(Candil.Model.t()) :: {:ok, binary()} | {:error, binary()}
Downloads a local model file to model.model_dir.
Does nothing if model.type is :remote.
Runs an embeddings request against a local model.
The engine must be running and the model must have :embeddings in its
usage list.
@spec embed(Candil.Model.t(), Candil.Provider.t(), [binary()], keyword()) :: {:ok, [[float()]]} | {:error, any()}
Runs an embeddings request against a remote model.
Returns true if an engine serving the given model alias is running and
responding to health checks.
@spec start_engine(Candil.Engine.t(), Candil.Model.t()) :: {:ok, pid()} | {:error, binary()}
Starts a local llama-server engine loaded with model.
Returns {:ok, pid} where pid is the Candil.Engine.Server process.
The server process is registered under the model alias in
Candil.Registry.
@spec stop_engine(atom()) :: :ok | {:error, :not_running}
Stops a running engine identified by the model alias.
@spec stream( atom(), [Candil.Inference.message()], Candil.Stream.stream_callback(), keyword() ) :: :ok | {:error, any()}
Streams a chat completion from a local engine.
The engine must be running. Calls callback for each token chunk.
@spec stream( Candil.Model.t(), Candil.Provider.t(), [Candil.Inference.message()], Candil.Stream.stream_callback(), keyword() ) :: :ok | {:error, any()}
Streams a chat completion from a remote provider.