Candil.Llm (Candil v1.0.0)

Copy Markdown View Source

LLM management for Candil.

Provides a unified interface for running, managing and querying large language models — whether they run locally via llama.cpp or remotely via an API provider such as OpenAI, Anthropic or Ollama.

Concepts

Engine

An engine is a local llama-server binary that serves one model at a time over an OpenAI-compatible HTTP API. You can use a pre-existing binary on the machine or let Apero download the official precompiled release from the llama.cpp releases page.

Provider

A provider is a remote HTTP API (OpenAI, Anthropic, Ollama, or any OpenAI-compatible endpoint). Ollama is treated as a remote provider because it manages its own process and model storage independently.

Model

A model is either:

  • Local — a .gguf file on disk, associated with an engine.
  • Remote — a model name / ID offered by a provider.

Lifecycle — local model

# 1. Define engine and model
engine = %Candil.Engine{
  alias: :llama_server,
  binary_dir: "/usr/local/bin",
  use_precompiled: true,
  precompiled_version: :latest,
  start_args: ["--host", "127.0.0.1"]
}

model = %Candil.Model{
  alias: :llama3,
  type: :local,
  model_dir: "/models",
  filename: "llama-3-8b-q4_k_m.gguf",
  download_url: "https://huggingface.co/.../llama-3-8b-q4_k_m.gguf",
  context_size: 8192,
  engine: :llama_server,
  usage: [:chat, :completion],
  model_args: ["--n-gpu-layers", "35"]
}

# 2. (Optional) download binary and model
:ok = Candil.download_engine(engine)
{:ok, _} = Candil.download_model(model)

# 3. Start engine serving the model
{:ok, _pid} = Candil.start_engine(engine, model)

# 4. Run inference
{:ok, response} = Candil.chat(:llama3, [
  %{role: "user", content: "Hello!"}
])

# 5. Stop engine
:ok = Candil.stop_engine(:llama3)

Lifecycle — remote model

provider = %Candil.Provider{
  alias: :openai,
  type: :openai,
  base_url: "https://api.openai.com",
  api_key: System.get_env("OPENAI_API_KEY")
}

model = %Candil.Model{
  alias: :gpt4o,
  type: :remote,
  name: "gpt-4o",
  context_size: 128_000,
  provider: :openai,
  usage: [:chat, :completion, :embeddings]
}

{:ok, response} = Candil.chat(:gpt4o, provider, [
  %{role: "user", content: "Hello!"}
])

Summary

Functions

Runs a chat completion against a local model (identified by alias).

Runs a chat completion against a remote model via a provider.

Downloads the appropriate precompiled llama.cpp binary for this engine.

Downloads a local model file to model.model_dir.

Runs an embeddings request against a local model.

Runs an embeddings request against a remote model.

Returns true if an engine serving the given model alias is running and responding to health checks.

Starts a local llama-server engine loaded with model.

Stops a running engine identified by the model alias.

Streams a chat completion from a local engine.

Streams a chat completion from a remote provider.

Functions

chat(model_alias, messages, opts \\ [])

@spec chat(atom(), [Candil.Inference.message()], keyword()) ::
  {:ok, Candil.Inference.response()} | {:error, any()}

Runs a chat completion against a local model (identified by alias).

The engine must already be running via start_engine/2.

Options

  • :temperature — sampling temperature (default: 0.7)
  • :max_tokens — maximum tokens to generate (default: 512)
  • :stop — list of stop sequences

chat(model, provider, messages, opts)

@spec chat(
  Candil.Model.t(),
  Candil.Provider.t(),
  [Candil.Inference.message()],
  keyword()
) ::
  {:ok, Candil.Inference.response()} | {:error, any()}

Runs a chat completion against a remote model via a provider.

Options

Same as chat/3.

download_engine(engine)

@spec download_engine(Candil.Engine.t()) :: :ok | {:error, binary()}

Downloads the appropriate precompiled llama.cpp binary for this engine.

Detects the current OS, architecture and GPU automatically. Does nothing if use_precompiled is false.

download_model(model)

@spec download_model(Candil.Model.t()) :: {:ok, binary()} | {:error, binary()}

Downloads a local model file to model.model_dir.

Does nothing if model.type is :remote.

embed(model_alias, texts, opts \\ [])

@spec embed(atom(), [binary()], keyword()) :: {:ok, [[float()]]} | {:error, any()}

Runs an embeddings request against a local model.

The engine must be running and the model must have :embeddings in its usage list.

embed(model, provider, texts, opts)

@spec embed(Candil.Model.t(), Candil.Provider.t(), [binary()], keyword()) ::
  {:ok, [[float()]]} | {:error, any()}

Runs an embeddings request against a remote model.

engine_healthy?(model_alias)

@spec engine_healthy?(atom()) :: boolean()

Returns true if an engine serving the given model alias is running and responding to health checks.

start_engine(engine, model)

@spec start_engine(Candil.Engine.t(), Candil.Model.t()) ::
  {:ok, pid()} | {:error, binary()}

Starts a local llama-server engine loaded with model.

Returns {:ok, pid} where pid is the Candil.Engine.Server process. The server process is registered under the model alias in Candil.Registry.

stop_engine(model_alias)

@spec stop_engine(atom()) :: :ok | {:error, :not_running}

Stops a running engine identified by the model alias.

stream(model_alias, messages, callback, opts \\ [])

@spec stream(
  atom(),
  [Candil.Inference.message()],
  Candil.Stream.stream_callback(),
  keyword()
) ::
  :ok | {:error, any()}

Streams a chat completion from a local engine.

The engine must be running. Calls callback for each token chunk.

stream(model, provider, messages, callback, opts)

Streams a chat completion from a remote provider.