HuggingfaceClient.Inference.TGI (huggingface_client v0.1.0)

Copy Markdown View Source

Client for HuggingFace Text Generation Inference (TGI) servers.

TGI is a high-performance inference server for deploying large language models. This client works with both:

  • Self-hosted TGI servers (docker run ghcr.io/huggingface/text-generation-inference)
  • HuggingFace Inference Endpoints powered by TGI

See: https://huggingface.co/docs/text-generation-inference

Quick start

# Connect to a local TGI server
client = HuggingfaceClient.Inference.TGI.new("http://localhost:8080")

# Or an Inference Endpoint
client = HuggingfaceClient.Inference.TGI.new(
  "https://xxx.aws.endpoints.huggingface.cloud",
  token: "hf_..."
)

# Generate text
{:ok, resp} = HuggingfaceClient.Inference.TGI.generate(client,
  inputs: "What is deep learning?",
  max_new_tokens: 200
)
IO.puts(resp["generated_text"])

# Chat completion (OpenAI-compatible)
{:ok, resp} = HuggingfaceClient.Inference.TGI.chat_completion(client,
  messages: [%{"role" => "user", "content" => "Hello!"}],
  max_tokens: 100
)

# Streaming
HuggingfaceClient.Inference.TGI.generate_stream(client,
  inputs: "Tell me a story about",
  max_new_tokens: 500
)
|> Enum.each(fn token -> IO.write(token["text"]) end)

Summary

Functions

OpenAI-compatible chat completion via TGI.

Decodes token IDs back to text.

Generates embeddings (for TEI-compatible endpoints).

Generates text from an input prompt.

Batch text generation.

Streams text generation token by token.

Checks server health. Returns :ok if healthy.

Gets information about the running TGI server (model, config, etc.).

Creates a new TGI client.

Tokenizes text and returns token count and IDs.

Types

t()

@type t() :: %HuggingfaceClient.Inference.TGI{
  base_url: String.t(),
  timeout: pos_integer(),
  token: String.t() | nil
}

Functions

chat_completion(client, opts)

@spec chat_completion(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}

OpenAI-compatible chat completion via TGI.

Options

  • :messages — list of message maps with "role" and "content" (required)
  • :max_tokens — maximum tokens to generate
  • :temperature — sampling temperature
  • :top_p — nucleus sampling
  • :stop — stop sequences
  • :stream — if true, returns streaming response (default: false)

Example

{:ok, resp} = HuggingfaceClient.Inference.TGI.chat_completion(client,
  messages: [
    %{"role" => "system",  "content" => "You are a helpful assistant."},
    %{"role" => "user",    "content" => "What is 2+2?"}
  ],
  max_tokens: 100
)
IO.puts(resp["choices"] |> hd() |> get_in(["message", "content"]))

decode(client, opts)

@spec decode(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}

Decodes token IDs back to text.

Example

{:ok, result} = HuggingfaceClient.Inference.TGI.decode(client, ids: [1, 2, 3, 4])
IO.puts(result["decoded_text"])

embed(client, opts)

@spec embed(
  t(),
  keyword()
) :: {:ok, [[float()]]} | {:error, Exception.t()}

Generates embeddings (for TEI-compatible endpoints).

Example

{:ok, embedding} = HuggingfaceClient.Inference.TGI.embed(client,
  inputs: "Hello, world!"
)
IO.puts("Embedding dimensions: #{length(embedding)}")

generate(client, opts)

@spec generate(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}

Generates text from an input prompt.

Options

  • :inputs — input text prompt (required)
  • :max_new_tokens — maximum number of tokens to generate (default: 20)
  • :temperature — sampling temperature (0.0 = greedy)
  • :top_p — nucleus sampling probability
  • :top_k — top-k sampling
  • :repetition_penalty — penalize repeated tokens (> 1.0 to penalize)
  • :stop — list of stop sequences
  • :seed — random seed for reproducibility
  • :do_sample — if false, use greedy decoding
  • :return_full_text — if true, include input in response
  • :best_of — generate N samples, return best (increases latency)
  • :watermark — add a watermark to the output

Example

{:ok, resp} = HuggingfaceClient.Inference.TGI.generate(client,
  inputs: "What is the capital of France?",
  max_new_tokens: 50,
  temperature: 0.7
)
IO.puts(resp["generated_text"])

generate_batch(client, opts)

@spec generate_batch(
  t(),
  keyword()
) :: {:ok, [map()]} | {:error, Exception.t()}

Batch text generation.

Options

  • :inputs — list of input prompts (required)
  • Same generation parameters as generate/2

Example

{:ok, results} = HuggingfaceClient.Inference.TGI.generate_batch(client,
  inputs: ["Hello world!", "What is AI?"],
  max_new_tokens: 100
)
Enum.each(results, fn r -> IO.puts(r["generated_text"]) end)

generate_stream(client, opts)

@spec generate_stream(
  t(),
  keyword()
) :: Enumerable.t()

Streams text generation token by token.

Returns an enumerable of token maps, each containing:

  • "token" — map with "id", "text", "logprob", "special"
  • "generated_text" — full text so far (only on last token)
  • "details" — generation details (only on last token)

Example

HuggingfaceClient.Inference.TGI.generate_stream(client,
  inputs: "Once upon a time",
  max_new_tokens: 200
)
|> Enum.each(fn token ->
  IO.write(token["token"]["text"])
end)
IO.puts("")  # newline at end

health(client)

@spec health(t()) :: :ok | {:error, Exception.t()}

Checks server health. Returns :ok if healthy.

info(client)

@spec info(t()) :: {:ok, map()} | {:error, Exception.t()}

Gets information about the running TGI server (model, config, etc.).

Example

{:ok, info} = HuggingfaceClient.Inference.TGI.info(client)
IO.puts("Model: #{info["model_id"]}")
IO.puts("Max tokens: #{info["max_total_tokens"]}")

new(base_url, opts \\ [])

@spec new(
  String.t(),
  keyword()
) :: t()

Creates a new TGI client.

Parameters

  • base_url — URL of the TGI server (e.g. "http://localhost:8080")

Options

  • :token — Bearer token for authentication
  • :timeout — request timeout in milliseconds (default: 60_000)

Example

# Local server
client = HuggingfaceClient.Inference.TGI.new("http://localhost:8080")

# Inference endpoint with auth
client = HuggingfaceClient.Inference.TGI.new(
  "https://my-endpoint.aws.endpoints.huggingface.cloud",
  token: "hf_..."
)

tokenize(client, opts)

@spec tokenize(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}

Tokenizes text and returns token count and IDs.

Example

{:ok, result} = HuggingfaceClient.Inference.TGI.tokenize(client, inputs: "Hello, world!")
IO.puts("Token count: #{length(result["tokens"])}")