Client for HuggingFace Text Generation Inference (TGI) servers.
TGI is a high-performance inference server for deploying large language models. This client works with both:
- Self-hosted TGI servers (
docker run ghcr.io/huggingface/text-generation-inference) - HuggingFace Inference Endpoints powered by TGI
See: https://huggingface.co/docs/text-generation-inference
Quick start
# Connect to a local TGI server
client = HuggingfaceClient.Inference.TGI.new("http://localhost:8080")
# Or an Inference Endpoint
client = HuggingfaceClient.Inference.TGI.new(
"https://xxx.aws.endpoints.huggingface.cloud",
token: "hf_..."
)
# Generate text
{:ok, resp} = HuggingfaceClient.Inference.TGI.generate(client,
inputs: "What is deep learning?",
max_new_tokens: 200
)
IO.puts(resp["generated_text"])
# Chat completion (OpenAI-compatible)
{:ok, resp} = HuggingfaceClient.Inference.TGI.chat_completion(client,
messages: [%{"role" => "user", "content" => "Hello!"}],
max_tokens: 100
)
# Streaming
HuggingfaceClient.Inference.TGI.generate_stream(client,
inputs: "Tell me a story about",
max_new_tokens: 500
)
|> Enum.each(fn token -> IO.write(token["text"]) end)
Summary
Functions
OpenAI-compatible chat completion via TGI.
Decodes token IDs back to text.
Generates embeddings (for TEI-compatible endpoints).
Generates text from an input prompt.
Batch text generation.
Streams text generation token by token.
Checks server health. Returns :ok if healthy.
Gets information about the running TGI server (model, config, etc.).
Creates a new TGI client.
Tokenizes text and returns token count and IDs.
Types
@type t() :: %HuggingfaceClient.Inference.TGI{ base_url: String.t(), timeout: pos_integer(), token: String.t() | nil }
Functions
@spec chat_completion( t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
OpenAI-compatible chat completion via TGI.
Options
:messages— list of message maps with"role"and"content"(required):max_tokens— maximum tokens to generate:temperature— sampling temperature:top_p— nucleus sampling:stop— stop sequences:stream— iftrue, returns streaming response (default:false)
Example
{:ok, resp} = HuggingfaceClient.Inference.TGI.chat_completion(client,
messages: [
%{"role" => "system", "content" => "You are a helpful assistant."},
%{"role" => "user", "content" => "What is 2+2?"}
],
max_tokens: 100
)
IO.puts(resp["choices"] |> hd() |> get_in(["message", "content"]))
@spec decode( t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Decodes token IDs back to text.
Example
{:ok, result} = HuggingfaceClient.Inference.TGI.decode(client, ids: [1, 2, 3, 4])
IO.puts(result["decoded_text"])
@spec embed( t(), keyword() ) :: {:ok, [[float()]]} | {:error, Exception.t()}
Generates embeddings (for TEI-compatible endpoints).
Example
{:ok, embedding} = HuggingfaceClient.Inference.TGI.embed(client,
inputs: "Hello, world!"
)
IO.puts("Embedding dimensions: #{length(embedding)}")
@spec generate( t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Generates text from an input prompt.
Options
:inputs— input text prompt (required):max_new_tokens— maximum number of tokens to generate (default: 20):temperature— sampling temperature (0.0 = greedy):top_p— nucleus sampling probability:top_k— top-k sampling:repetition_penalty— penalize repeated tokens (> 1.0 to penalize):stop— list of stop sequences:seed— random seed for reproducibility:do_sample— iffalse, use greedy decoding:return_full_text— iftrue, include input in response:best_of— generate N samples, return best (increases latency):watermark— add a watermark to the output
Example
{:ok, resp} = HuggingfaceClient.Inference.TGI.generate(client,
inputs: "What is the capital of France?",
max_new_tokens: 50,
temperature: 0.7
)
IO.puts(resp["generated_text"])
@spec generate_batch( t(), keyword() ) :: {:ok, [map()]} | {:error, Exception.t()}
Batch text generation.
Options
:inputs— list of input prompts (required)- Same generation parameters as
generate/2
Example
{:ok, results} = HuggingfaceClient.Inference.TGI.generate_batch(client,
inputs: ["Hello world!", "What is AI?"],
max_new_tokens: 100
)
Enum.each(results, fn r -> IO.puts(r["generated_text"]) end)
@spec generate_stream( t(), keyword() ) :: Enumerable.t()
Streams text generation token by token.
Returns an enumerable of token maps, each containing:
"token"— map with"id","text","logprob","special""generated_text"— full text so far (only on last token)"details"— generation details (only on last token)
Example
HuggingfaceClient.Inference.TGI.generate_stream(client,
inputs: "Once upon a time",
max_new_tokens: 200
)
|> Enum.each(fn token ->
IO.write(token["token"]["text"])
end)
IO.puts("") # newline at end
@spec health(t()) :: :ok | {:error, Exception.t()}
Checks server health. Returns :ok if healthy.
@spec info(t()) :: {:ok, map()} | {:error, Exception.t()}
Gets information about the running TGI server (model, config, etc.).
Example
{:ok, info} = HuggingfaceClient.Inference.TGI.info(client)
IO.puts("Model: #{info["model_id"]}")
IO.puts("Max tokens: #{info["max_total_tokens"]}")
Creates a new TGI client.
Parameters
base_url— URL of the TGI server (e.g."http://localhost:8080")
Options
:token— Bearer token for authentication:timeout— request timeout in milliseconds (default: 60_000)
Example
# Local server
client = HuggingfaceClient.Inference.TGI.new("http://localhost:8080")
# Inference endpoint with auth
client = HuggingfaceClient.Inference.TGI.new(
"https://my-endpoint.aws.endpoints.huggingface.cloud",
token: "hf_..."
)
@spec tokenize( t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Tokenizes text and returns token count and IDs.
Example
{:ok, result} = HuggingfaceClient.Inference.TGI.tokenize(client, inputs: "Hello, world!")
IO.puts("Token count: #{length(result["tokens"])}")