ReqLLM.Providers.VLLM (ReqLLM v1.14.0)

View Source

vLLM provider – self-hosted OpenAI-compatible Chat Completions API.

Implementation

Uses built-in OpenAI-style encoding/decoding defaults. vLLM is fully OpenAI-compatible, so no custom request/response handling is needed.

Self-Hosted Configuration

vLLM is a self-hosted inference server. Users must:

  1. Deploy a vLLM service (via pip install, Docker, or other methods)
  2. Configure the model to serve (e.g., --served-model-name my-model)
  3. Set the base_url to point to their vLLM instance

Since vLLM runs multiple models on different ports, use the :base_url option per-request or configure model entries with their specific URLs.

Authentication

By default, vLLM uses OPENAI_API_KEY as an environment variable. The presence of a value is required but typically not validated by vLLM. Set any non-empty value if authentication is not configured on your vLLM server.

Configuration

# Add to .env file (automatically loaded)
OPENAI_API_KEY=any-value-for-vllm

Examples

# Basic usage with default localhost
ReqLLM.generate_text("vllm:my-local-model", "Hello!")

# With custom base_url for a specific vLLM instance
ReqLLM.generate_text("vllm:llama-3", "Hello!",
  base_url: "http://my-server:8001/v1"
)

# Streaming
ReqLLM.stream_text("vllm:mistral-7b", "Tell me a story")
|> Enum.each(&IO.write/1)

Summary

Functions

Default implementation of attach/3.

Default implementation of attach_stream/4.

Default implementation of build_body/1.

Default implementation of decode_response/1.

Default implementation of decode_stream_event/2.

Default implementation of encode_body/1.

Default implementation of extract_usage/2.

Default implementation of prepare_request/4.

Default implementation of translate_options/3.

Functions

attach(request, model_input, user_opts)

Default implementation of attach/3.

Sets up Bearer token authentication and standard pipeline steps.

attach_stream(model, context, opts, finch_name)

Default implementation of attach_stream/4.

Builds complete streaming requests using OpenAI-compatible format.

base_url()

build_body(request)

Default implementation of build_body/1.

Builds request body using OpenAI-compatible format for chat and embedding operations.

decode_response(request_response)

Default implementation of decode_response/1.

Handles success/error responses with standard ReqLLM.Response creation.

decode_stream_event(event, model)

Default implementation of decode_stream_event/2.

Decodes SSE events using OpenAI-compatible format.

default_base_url()

default_env_key()

Callback implementation for ReqLLM.Provider.default_env_key/0.

encode_body(request)

Default implementation of encode_body/1.

Encodes request body using OpenAI-compatible format for chat and embedding operations.

extract_usage(body, model)

Default implementation of extract_usage/2.

Extracts usage data from standard usage field in response body.

prepare_request(operation, model_spec, input, opts)

Default implementation of prepare_request/4.

Handles :chat, :object, and :embedding operations using OpenAI-compatible patterns.

provider_extended_generation_schema()

provider_id()

provider_schema()

supported_provider_options()

translate_options(operation, model, opts)

Default implementation of translate_options/3.

Pass-through implementation that returns options unchanged.