ExLLM User Guide

View Source

This comprehensive guide covers all features and capabilities of the ExLLM library.

Table of Contents

  1. Installation and Setup
  2. Configuration
  3. Basic Usage
  4. Providers
  5. Chat Completions
  6. Streaming
  7. Session Management
  8. Context Management
  9. Function Calling
  10. Vision and Multimodal
  11. Embeddings
  12. Structured Outputs
  13. Cost Tracking
  14. Error Handling and Retries
  15. Caching
  16. Response Caching
  17. Model Discovery
  18. Provider Capabilities
  19. Logging
  20. Testing with Mock Adapter
  21. Advanced Topics

Installation and Setup

Adding to Your Project

Add ExLLM to your mix.exs dependencies:

def deps do
  [
    {:ex_llm, "~> 0.4.1"},
    # Included dependencies (automatically installed with ex_llm):
    # - {:instructor, "~> 0.1.0"} - For structured outputs
    # - {:bumblebee, "~> 0.5"} - For local model inference
    # - {:nx, "~> 0.7"} - For numerical computing
    
    # Optional hardware acceleration (choose one):
    # {:exla, "~> 0.7"}  # For CUDA/ROCm GPUs
    # {:emlx, github: "elixir-nx/emlx", branch: "main"}  # For Apple Silicon
  ]
end

Run mix deps.get to install the dependencies.

Optional Dependencies

  • Req: HTTP client (automatically included)
  • Jason: JSON parser (automatically included)
  • Instructor: Structured outputs with schema validation (automatically included)
  • Bumblebee: Local model inference (automatically included)
  • Nx: Numerical computing (automatically included)
  • EXLA: CUDA/ROCm GPU acceleration (optional)
  • EMLX: Apple Silicon Metal acceleration (optional)

Configuration

ExLLM supports multiple configuration methods to suit different use cases.

Environment Variables

The simplest way to configure ExLLM:

# OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.openai.com/v1"  # Optional custom endpoint

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

# Google Gemini
export GOOGLE_API_KEY="..."
# or
export GEMINI_API_KEY="..."

# Groq
export GROQ_API_KEY="gsk_..."

# OpenRouter
export OPENROUTER_API_KEY="sk-or-..."

# X.AI
export XAI_API_KEY="xai-..."

# Mistral AI
export MISTRAL_API_KEY="..."

# Perplexity
export PERPLEXITY_API_KEY="pplx-..."

# AWS Bedrock
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"

# Ollama
export OLLAMA_API_BASE="http://localhost:11434"

# LM Studio
export LMSTUDIO_API_BASE="http://localhost:1234"

Static Configuration

For more control, use static configuration:

config = %{
  openai: %{
    api_key: "sk-...",
    api_base: "https://api.openai.com/v1",
    default_model: "gpt-4o"
  },
  anthropic: %{
    api_key: "sk-ant-...",
    default_model: "claude-3-5-sonnet-20241022"
  }
}

{:ok, provider} = ExLLM.ConfigProvider.Static.start_link(config)

# Use with config_provider option
{:ok, response} = ExLLM.chat(:openai, messages, config_provider: provider)

Custom Configuration Provider

Implement your own configuration provider:

defmodule MyApp.ConfigProvider do
  @behaviour ExLLM.ConfigProvider
  
  def get([:openai, :api_key]), do: fetch_from_vault("openai_key")
  def get([:anthropic, :api_key]), do: fetch_from_vault("anthropic_key")
  def get(_path), do: nil
  
  def get_all() do
    %{
      openai: %{api_key: fetch_from_vault("openai_key")},
      anthropic: %{api_key: fetch_from_vault("anthropic_key")}
    }
  end
end

# Use it
{:ok, response} = ExLLM.chat(:openai, messages, 
  config_provider: MyApp.ConfigProvider
)

Basic Usage

Simple Chat

messages = [
  %{role: "user", content: "Hello, how are you?"}
]

{:ok, response} = ExLLM.chat(:openai, messages)
IO.puts(response.content)

Provider/Model Syntax

# Use provider/model string syntax
{:ok, response} = ExLLM.chat("anthropic/claude-3-haiku-20240307", messages)

# Equivalent to
{:ok, response} = ExLLM.chat(:anthropic, messages, 
  model: "claude-3-haiku-20240307"
)

Response Structure

%ExLLM.Types.LLMResponse{
  content: "I'm doing well, thank you!",
  model: "gpt-4o",
  finish_reason: "stop",
  usage: %{
    input_tokens: 12,
    output_tokens: 8,
    total_tokens: 20
  },
  cost: %{
    input_cost: 0.00006,
    output_cost: 0.00016,
    total_cost: 0.00022,
    currency: "USD"
  }
}

Providers

Supported Providers

ExLLM supports these providers out of the box:

  • :openai - OpenAI GPT models
  • :anthropic - Anthropic Claude models
  • :gemini - Google Gemini models
  • :groq - Groq fast inference
  • :mistral - Mistral AI models
  • :perplexity - Perplexity search-enhanced models
  • :ollama - Local models via Ollama
  • :lmstudio - Local models via LM Studio
  • :bedrock - AWS Bedrock
  • :openrouter - OpenRouter (300+ models)
  • :xai - X.AI Grok models
  • :bumblebee - Local models via Bumblebee/NX
  • :mock - Mock adapter for testing

Checking Provider Configuration

# Check if a provider is configured
if ExLLM.configured?(:openai) do
  {:ok, response} = ExLLM.chat(:openai, messages)
end

# Get default model for a provider
model = ExLLM.default_model(:anthropic)
# => "claude-3-5-sonnet-20241022"

# List available models
{:ok, models} = ExLLM.list_models(:openai)
for model <- models do
  IO.puts("#{model.id}: #{model.context_window} tokens")
end

Chat Completions

Basic Options

{:ok, response} = ExLLM.chat(:openai, messages,
  model: "gpt-4o",           # Specific model
  temperature: 0.7,          # 0.0-1.0, higher = more creative
  max_tokens: 1000,          # Max response length
  top_p: 0.9,                # Nucleus sampling
  frequency_penalty: 0.5,    # Reduce repetition
  presence_penalty: 0.5,     # Encourage new topics
  stop: ["\n\n", "END"],     # Stop sequences
  seed: 12345,               # Reproducible outputs
  timeout: 60_000            # Request timeout in ms (default: provider-specific)
)

Timeout Configuration

Different providers have different timeout requirements. ExLLM allows you to configure timeouts per request:

# Ollama with function calling (can be slow)
{:ok, response} = ExLLM.chat(:ollama, messages,
  functions: functions,
  timeout: 300_000  # 5 minutes
)

# Quick requests with shorter timeout
{:ok, response} = ExLLM.chat(:openai, messages,
  timeout: 30_000   # 30 seconds
)

Default timeouts:

  • Ollama: 120,000ms (2 minutes) - Local models can be slower
  • Other providers: Use their HTTP client defaults (typically 30-60 seconds)

System Messages

messages = [
  %{role: "system", content: "You are a helpful coding assistant."},
  %{role: "user", content: "How do I read a file in Elixir?"}
]

{:ok, response} = ExLLM.chat(:openai, messages)

Multi-turn Conversations

conversation = [
  %{role: "user", content: "What's the capital of France?"},
  %{role: "assistant", content: "The capital of France is Paris."},
  %{role: "user", content: "What's the population?"}
]

{:ok, response} = ExLLM.chat(:openai, conversation)

Streaming

Basic Streaming

{:ok, stream} = ExLLM.stream_chat(:openai, messages)

for chunk <- stream do
  case chunk do
    %{content: content} when content != nil ->
      IO.write(content)
      
    %{finish_reason: reason} when reason != nil ->
      IO.puts("\nFinished: #{reason}")
      
    _ ->
      # Other chunk types (role, etc.)
      :ok
  end
end

Streaming with Callback

{:ok, stream} = ExLLM.stream_chat(:openai, messages,
  on_chunk: fn chunk ->
    if chunk.content, do: IO.write(chunk.content)
  end
)

# Consume the stream
Enum.to_list(stream)

Collecting Streamed Response

{:ok, stream} = ExLLM.stream_chat(:openai, messages)

# Collect all chunks into a single response
full_content = 
  stream
  |> Enum.map(& &1.content)
  |> Enum.reject(&is_nil/1)
  |> Enum.join("")

Stream Recovery

Enable automatic stream recovery for interrupted streams:

{:ok, stream} = ExLLM.stream_chat(:openai, messages,
  stream_recovery: true,
  recovery_strategy: :exact  # :exact, :paragraph, or :summarize
)

# If stream is interrupted, you can resume
{:ok, resumed_stream} = ExLLM.resume_stream(recovery_id)

Session Management

Sessions provide stateful conversation management with automatic token tracking.

Creating and Using Sessions

# Create a new session
session = ExLLM.new_session(:openai, name: "Customer Support")

# Chat with session (automatically manages message history)
{:ok, {response, session}} = ExLLM.chat_with_session(
  session,
  "What's the weather like?"
)

# Continue the conversation
{:ok, {response2, session}} = ExLLM.chat_with_session(
  session,
  "What should I wear?"
)

# Check token usage
total_tokens = ExLLM.session_token_usage(session)
IO.puts("Total tokens used: #{total_tokens}")

Managing Session Messages

# Add messages manually
session = ExLLM.add_session_message(session, "user", "Hello!")
session = ExLLM.add_session_message(session, "assistant", "Hi there!")

# Get message history
messages = ExLLM.get_session_messages(session)
recent_10 = ExLLM.get_session_messages(session, 10)

# Clear messages but keep session metadata
session = ExLLM.clear_session(session)

Persisting Sessions

# Save session to JSON
{:ok, json} = ExLLM.save_session(session)
File.write!("session.json", json)

# Load session from JSON
{:ok, json} = File.read("session.json")
{:ok, restored_session} = ExLLM.load_session(json)

Session with Context

# Create session with default context
session = ExLLM.new_session(:openai,
  name: "Tech Support",
  context: %{
    temperature: 0.3,
    system_message: "You are a technical support agent."
  }
)

# Context is automatically applied to all chats
{:ok, {response, session}} = ExLLM.chat_with_session(session, "Help!")

Context Management

Automatically manage conversation context to fit within model limits.

Context Window Validation

# Check if messages fit in context window
case ExLLM.validate_context(messages, provider: :openai, model: "gpt-4") do
  {:ok, token_count} ->
    IO.puts("Messages use #{token_count} tokens")
    
  {:error, reason} ->
    IO.puts("Messages too large: #{reason}")
end

# Get context window size for a model
window_size = ExLLM.context_window_size(:anthropic, "claude-3-opus-20240229")
# => 200000

Automatic Message Truncation

# Prepare messages to fit in context window
truncated = ExLLM.prepare_messages(long_conversation,
  provider: :openai,
  model: "gpt-4",
  max_tokens: 4000,           # Reserve tokens for response
  strategy: :sliding_window,   # or :smart
  preserve_messages: 5         # Always keep last 5 messages
)

Truncation Strategies

  1. :sliding_window - Keep most recent messages
  2. :smart - Preserve system messages and recent context
# Smart truncation preserves important context
{:ok, response} = ExLLM.chat(:openai, very_long_conversation,
  strategy: :smart,
  preserve_messages: 10
)

Context Statistics

stats = ExLLM.context_stats(messages)
# => %{
#   message_count: 20,
#   total_tokens: 1500,
#   by_role: %{"user" => 10, "assistant" => 9, "system" => 1},
#   avg_tokens_per_message: 75
# }

Function Calling

Enable AI models to call functions/tools in your application.

Basic Function Calling

# Define available functions
functions = [
  %{
    name: "get_weather",
    description: "Get current weather for a location",
    parameters: %{
      type: "object",
      properties: %{
        location: %{
          type: "string",
          description: "City and state, e.g. San Francisco, CA"
        },
        unit: %{
          type: "string",
          enum: ["celsius", "fahrenheit"],
          description: "Temperature unit"
        }
      },
      required: ["location"]
    }
  }
]

# Let the AI decide when to call functions
{:ok, response} = ExLLM.chat(:openai, 
  [%{role: "user", content: "What's the weather in NYC?"}],
  functions: functions,
  function_call: "auto"  # or "none" or %{name: "get_weather"}
)

Handling Function Calls

# Parse function calls from response
case ExLLM.parse_function_calls(response, :openai) do
  {:ok, [function_call | _]} ->
    # AI wants to call a function
    IO.inspect(function_call)
    # => %ExLLM.FunctionCalling.FunctionCall{
    #      name: "get_weather",
    #      arguments: %{"location" => "New York, NY"}
    #    }
    
    # Execute the function
    result = get_weather_impl(function_call.arguments["location"])
    
    # Format result for conversation
    function_message = ExLLM.format_function_result(
      %ExLLM.FunctionCalling.FunctionResult{
        name: "get_weather",
        result: result
      },
      :openai
    )
    
    # Continue conversation with function result
    messages = messages ++ [response_message, function_message]
    {:ok, final_response} = ExLLM.chat(:openai, messages)
    
  {:ok, []} ->
    # No function call, regular response
    IO.puts(response.content)
end

Function Execution

# Define functions with handlers
functions_with_handlers = [
  %{
    name: "calculate",
    description: "Perform mathematical calculations",
    parameters: %{
      type: "object",
      properties: %{
        expression: %{type: "string"}
      },
      required: ["expression"]
    },
    handler: fn args ->
      # Your implementation
      {result, _} = Code.eval_string(args["expression"])
      %{result: result}
    end
  }
]

# Execute function automatically
{:ok, result} = ExLLM.execute_function(function_call, functions_with_handlers)

Provider-Specific Notes

Different providers use different terminology:

  • OpenAI: "functions" and "function_call"
  • Anthropic: "tools" and "tool_use"
  • ExLLM normalizes these automatically

Vision and Multimodal

Work with images and other media types.

Basic Image Analysis

# Create a vision message
{:ok, message} = ExLLM.vision_message(
  "What's in this image?",
  ["path/to/image.jpg"]
)

# Send to vision-capable model
{:ok, response} = ExLLM.chat(:openai, [message],
  model: "gpt-4o"  # or any vision model
)

Multiple Images

{:ok, message} = ExLLM.vision_message(
  "Compare these images",
  [
    "image1.jpg",
    "image2.jpg",
    "https://example.com/image3.png"  # URLs work too
  ],
  detail: :high  # :low, :high, or :auto
)

Loading Images

# Load image with options
{:ok, image_part} = ExLLM.load_image("photo.jpg",
  detail: :high,
  resize: {1024, 1024}  # Optional resizing
)

# Build custom message
message = %{
  role: "user",
  content: [
    %{type: "text", text: "Describe this image"},
    image_part
  ]
}

Checking Vision Support

# Check if provider/model supports vision
if ExLLM.supports_vision?(:anthropic, "claude-3-opus-20240229") do
  # This model supports vision
end

# Find all vision-capable models
vision_models = ExLLM.find_models_with_features([:vision])

Text Extraction from Images

# OCR-like functionality
{:ok, text} = ExLLM.extract_text_from_image(:openai, "document.png",
  model: "gpt-4o",
  prompt: "Extract all text, preserving formatting and layout"
)

Image Analysis

# Analyze multiple images
{:ok, analysis} = ExLLM.analyze_images(:anthropic,
  ["chart1.png", "chart2.png"],
  "Compare these charts and identify trends",
  model: "claude-3-5-sonnet-20241022"
)

Embeddings

Generate vector embeddings for semantic search and similarity.

Basic Embeddings

# Generate embeddings for text
{:ok, response} = ExLLM.embeddings(:openai, 
  ["Hello world", "Goodbye world"]
)

# Response structure
%ExLLM.Types.EmbeddingResponse{
  embeddings: [
    [0.0123, -0.0456, ...],  # 1536 dimensions for text-embedding-3-small
    [0.0789, -0.0234, ...]
  ],
  model: "text-embedding-3-small",
  usage: %{total_tokens: 8}
}

Embedding Options

{:ok, response} = ExLLM.embeddings(:openai, texts,
  model: "text-embedding-3-large",
  dimensions: 256,  # Reduce dimensions (model-specific)
  encoding_format: "float"  # or "base64"
)
# Calculate similarity between embeddings
similarity = ExLLM.cosine_similarity(embedding1, embedding2)
# => 0.87 (1.0 = identical, 0.0 = orthogonal, -1.0 = opposite)

# Find similar items
query_embedding = get_embedding("search query")
items = [
  %{id: 1, text: "Document 1", embedding: [...]},
  %{id: 2, text: "Document 2", embedding: [...]},
  # ...
]

results = ExLLM.find_similar(query_embedding, items,
  top_k: 10,
  threshold: 0.7  # Minimum similarity
)
# => [
#   %{item: %{id: 2, ...}, similarity: 0.92},
#   %{item: %{id: 5, ...}, similarity: 0.85},
#   ...
# ]

Listing Embedding Models

{:ok, models} = ExLLM.list_embedding_models(:openai)
for model <- models do
  IO.puts("#{model.name}: #{model.dimensions} dimensions")
end

Caching Embeddings

# Enable caching for embeddings
{:ok, response} = ExLLM.embeddings(:openai, texts,
  cache: true,
  cache_ttl: :timer.hours(24)
)

Structured Outputs

Generate structured data with schema validation using Instructor integration.

Basic Structured Output

defmodule EmailClassification do
  use Ecto.Schema
  
  embedded_schema do
    field :category, Ecto.Enum, values: [:personal, :work, :spam]
    field :priority, Ecto.Enum, values: [:high, :medium, :low]
    field :summary, :string
  end
end

{:ok, result} = ExLLM.chat(:openai, 
  [%{role: "user", content: "Classify this email: Meeting tomorrow at 3pm"}],
  response_model: EmailClassification,
  max_retries: 3  # Retry on validation failure
)

IO.inspect(result)
# => %EmailClassification{
#      category: :work,
#      priority: :high,
#      summary: "Meeting scheduled for tomorrow"
#    }

Complex Schemas

defmodule ProductExtraction do
  use Ecto.Schema
  
  embedded_schema do
    field :name, :string
    field :price, :decimal
    field :currency, :string
    field :in_stock, :boolean
    
    embeds_many :features, Feature do
      field :name, :string
      field :value, :string
    end
  end
  
  def changeset(struct, params) do
    struct
    |> Ecto.Changeset.cast(params, [:name, :price, :currency, :in_stock])
    |> Ecto.Changeset.cast_embed(:features)
    |> Ecto.Changeset.validate_required([:name, :price])
    |> Ecto.Changeset.validate_number(:price, greater_than: 0)
  end
end

{:ok, product} = ExLLM.chat(:anthropic,
  [%{role: "user", content: "Extract product info from: iPhone 15 Pro, $999, 256GB storage, A17 chip"}],
  response_model: ProductExtraction
)

Lists and Collections

defmodule TodoList do
  use Ecto.Schema
  
  embedded_schema do
    embeds_many :todos, Todo do
      field :task, :string
      field :priority, Ecto.Enum, values: [:high, :medium, :low]
      field :completed, :boolean, default: false
    end
  end
end

{:ok, todo_list} = ExLLM.chat(:openai,
  [%{role: "user", content: "Create a todo list for launching a new feature"}],
  response_model: TodoList
)

Cost Tracking

ExLLM automatically tracks API costs for all operations.

Automatic Cost Tracking

{:ok, response} = ExLLM.chat(:openai, messages)

# Cost is included in response
IO.inspect(response.cost)
# => %{
#   input_cost: 0.00003,
#   output_cost: 0.00006,
#   total_cost: 0.00009,
#   currency: "USD"
# }

# Format for display
IO.puts(ExLLM.format_cost(response.cost.total_cost))
# => "$0.009¢"

Manual Cost Calculation

usage = %{input_tokens: 1000, output_tokens: 500}
cost = ExLLM.calculate_cost(:openai, "gpt-4", usage)
# => %{
#   input_cost: 0.03,
#   output_cost: 0.06,
#   total_cost: 0.09,
#   currency: "USD",
#   per_million_input: 30.0,
#   per_million_output: 120.0
# }

Token Estimation

# Estimate tokens for text
tokens = ExLLM.estimate_tokens("Hello, world!")
# => 4

# Estimate for messages
tokens = ExLLM.estimate_tokens([
  %{role: "user", content: "Hi"},
  %{role: "assistant", content: "Hello!"}
])
# => 12

Disabling Cost Tracking

{:ok, response} = ExLLM.chat(:openai, messages,
  track_cost: false
)
# response.cost will be nil

Error Handling and Retries

Automatic Retries

Retries are enabled by default with exponential backoff:

{:ok, response} = ExLLM.chat(:openai, messages,
  retry: true,           # Default: true
  retry_count: 3,        # Default: 3 attempts
  retry_delay: 1000,     # Default: 1 second initial delay
  retry_backoff: :exponential,  # or :linear
  retry_jitter: true     # Add randomness to prevent thundering herd
)

Error Types

case ExLLM.chat(:openai, messages) do
  {:ok, response} ->
    IO.puts(response.content)
    
  {:error, %ExLLM.Error{type: :rate_limit} = error} ->
    IO.puts("Rate limited. Retry after: #{error.retry_after}")
    
  {:error, %ExLLM.Error{type: :invalid_api_key}} ->
    IO.puts("Check your API key configuration")
    
  {:error, %ExLLM.Error{type: :context_length_exceeded}} ->
    IO.puts("Message too long for model")
    
  {:error, %ExLLM.Error{type: :timeout}} ->
    IO.puts("Request timed out")
    
  {:error, error} ->
    IO.inspect(error)
end

Custom Retry Logic

defmodule MyApp.RetryHandler do
  def with_custom_retry(provider, messages, opts \\ []) do
    Enum.reduce_while(1..5, nil, fn attempt, _acc ->
      case ExLLM.chat(provider, messages, Keyword.put(opts, :retry, false)) do
        {:ok, response} ->
          {:halt, {:ok, response}}
          
        {:error, %{type: :rate_limit} = error} ->
          wait_time = error[:retry_after] || :timer.seconds(attempt * 10)
          Process.sleep(wait_time)
          {:cont, nil}
          
        {:error, _} = error ->
          if attempt == 5 do
            {:halt, error}
          else
            Process.sleep(:timer.seconds(attempt))
            {:cont, nil}
          end
      end
    end)
  end
end

Caching

Cache responses to reduce API calls and costs.

Basic Caching

# Enable caching globally
Application.put_env(:ex_llm, :cache_enabled, true)

# Or per request
{:ok, response} = ExLLM.chat(:openai, messages,
  cache: true,
  cache_ttl: :timer.minutes(15)  # Default: 15 minutes
)

# Same request will use cache
{:ok, cached_response} = ExLLM.chat(:openai, messages, cache: true)

Cache Management

# Clear specific cache entry
ExLLM.Cache.delete(cache_key)

# Clear all cache
ExLLM.Cache.clear()

# Get cache stats
stats = ExLLM.Cache.stats()
# => %{size: 42, hits: 100, misses: 20}

Custom Cache Keys

# Cache key is automatically generated from:
# - Provider
# - Messages
# - Relevant options (model, temperature, etc.)

# You can also use manual cache management
cache_key = ExLLM.Cache.generate_cache_key(:openai, messages, options)

Response Caching

Cache real provider responses for offline testing and development cost reduction.

ExLLM provides two approaches for response caching:

  1. Unified Cache System (Recommended) - Extends the runtime cache with optional disk persistence
  2. Legacy Response Cache - Standalone response collection system

The unified cache system extends ExLLM's runtime performance cache with optional disk persistence. This provides both speed benefits and testing capabilities from a single system.

Enabling Unified Cache Persistence

# Method 1: Environment variables (temporary)
export EX_LLM_CACHE_PERSIST=true
export EX_LLM_CACHE_DIR="/path/to/cache"  # Optional

# Method 2: Runtime configuration (recommended for tests)
ExLLM.Cache.configure_disk_persistence(true, "/path/to/cache")

# Method 3: Application configuration
config :ex_llm,
  cache_persist_disk: true,
  cache_disk_path: "/tmp/ex_llm_cache"

Automatic Response Collection with Unified Cache

When persistence is enabled, all cached responses are automatically stored to disk:

# Normal caching usage - responses automatically persist to disk when enabled
{:ok, response} = ExLLM.chat(messages, provider: :openai, cache: true)
{:ok, response} = ExLLM.chat(messages, provider: :anthropic, cache: true)

Benefits of Unified Cache System

  • Zero performance impact when persistence is disabled (default)
  • Single configuration controls both runtime cache and disk persistence
  • Natural development workflow - enable during development, disable in production
  • Automatic mock integration - cached responses work seamlessly with Mock adapter

Legacy Response Cache System

For compatibility, the original response cache system is still available:

Enabling Legacy Response Caching

# Enable response caching via environment variables
export EX_LLM_CACHE_RESPONSES=true
export EX_LLM_CACHE_DIR="/path/to/cache"  # Optional: defaults to /tmp/ex_llm_cache

Automatic Response Collection

When caching is enabled, all provider responses are automatically stored:

# Normal usage - responses are automatically cached
{:ok, response} = ExLLM.chat(messages, provider: :openai)
{:ok, stream} = ExLLM.stream_chat(messages, provider: :anthropic)

Cache Structure

Responses are organized by provider and endpoint:

/tmp/ex_llm_cache/
 openai/
    chat.json           # Chat completions
    streaming.json      # Streaming responses
 anthropic/
    chat.json           # Claude messages
    streaming.json      # Streaming responses
 openrouter/
     chat.json           # OpenRouter responses

Manual Response Storage

# Store a specific response
ExLLM.ResponseCache.store_response(
  "openai",                    # Provider
  "chat",                      # Endpoint
  %{messages: messages},       # Request data
  %{"choices" => [...]}        # Response data
)

Mock Adapter Integration

Configure the Mock adapter to replay cached responses from any provider:

Using Unified Cache System

With the unified cache system, responses are automatically available for mock testing when disk persistence is enabled:

# 1. Enable disk persistence during development/testing
ExLLM.Cache.configure_disk_persistence(true, "/tmp/ex_llm_cache")

# 2. Use normal caching to collect responses
{:ok, response} = ExLLM.chat(:openai, messages, cache: true)
{:ok, response} = ExLLM.chat(:anthropic, messages, cache: true)

# 3. Configure mock adapter to use cached responses
ExLLM.ResponseCache.configure_mock_provider(:openai)

# 4. Mock calls now return authentic cached responses
{:ok, response} = ExLLM.chat(messages, provider: :mock)
# Returns real OpenAI response structure and content

# 5. Switch to different provider responses
ExLLM.ResponseCache.configure_mock_provider(:anthropic)
{:ok, response} = ExLLM.chat(messages, provider: :mock)
# Now returns real Anthropic response structure

Using Legacy Response Cache

For compatibility with the original caching approach:

# Enable legacy response caching
export EX_LLM_CACHE_RESPONSES=true

# Use cached OpenAI responses for realistic testing
ExLLM.ResponseCache.configure_mock_provider(:openai)

# Now mock calls return authentic OpenAI responses
{:ok, response} = ExLLM.chat(messages, provider: :mock)
# Returns real OpenAI response structure and content

Response Collection for Testing

Collect comprehensive test scenarios:

# Collect responses for common test cases
ExLLM.CachingInterceptor.create_test_collection(:openai)

# Collect specific scenarios
test_cases = [
  {[%{role: "user", content: "Hello"}], []},
  {[%{role: "user", content: "What is 2+2?"}], [max_tokens: 10]},
  {[%{role: "user", content: "Tell me a joke"}], [temperature: 0.8]}
]

ExLLM.CachingInterceptor.collect_test_responses(:anthropic, test_cases)

Cache Management

# List available cached providers
providers = ExLLM.ResponseCache.list_cached_providers()
# => [{"openai", 15}, {"anthropic", 8}]  # {provider, response_count}

# Clear cache for specific provider
ExLLM.ResponseCache.clear_provider_cache("openai")

# Clear all cached responses
ExLLM.ResponseCache.clear_all_cache()

# Get specific cached response
cached = ExLLM.ResponseCache.get_response("openai", "chat", request_data)

Configuration Options

# Environment variables
EX_LLM_CACHE_RESPONSES=true      # Enable/disable caching
EX_LLM_CACHE_DIR="/custom/path"  # Custom cache directory

# Check if caching is enabled
ExLLM.ResponseCache.caching_enabled?()
# => true

# Get current cache directory
ExLLM.ResponseCache.cache_dir()
# => "/tmp/ex_llm_cache"

Use Cases

Development Testing with Unified Cache:

# 1. Enable disk persistence during development
ExLLM.Cache.configure_disk_persistence(true)

# 2. Use normal caching - responses get collected automatically
{:ok, response} = ExLLM.chat(:openai, messages, cache: true)
{:ok, response} = ExLLM.chat(:anthropic, messages, cache: true)

# 3. Use cached responses in tests
ExLLM.ResponseCache.configure_mock_provider(:openai)
# Tests now use real OpenAI response structures

Development Testing with Legacy Cache:

# 1. Collect responses during development
export EX_LLM_CACHE_RESPONSES=true
# Run your app normally - responses get cached

# 2. Use cached responses in tests
ExLLM.ResponseCache.configure_mock_provider(:openai)
# Tests now use real OpenAI response structures

Cost Reduction:

# Unified cache approach - enable persistence temporarily
ExLLM.Cache.configure_disk_persistence(true)

# Cache expensive model responses during development
{:ok, response} = ExLLM.chat(:openai, messages,
  cache: true,
  model: "gpt-4o"  # Expensive model
)
# Response is cached automatically both in memory and disk

# Later testing uses cached response - no API cost
ExLLM.ResponseCache.configure_mock_provider(:openai)
{:ok, same_response} = ExLLM.chat(messages, provider: :mock)

# Disable persistence for production
ExLLM.Cache.configure_disk_persistence(false)

Cross-Provider Testing:

# Test how your app handles different provider response formats
ExLLM.ResponseCache.configure_mock_provider(:openai)
test_openai_format()

ExLLM.ResponseCache.configure_mock_provider(:anthropic)
test_anthropic_format()

ExLLM.ResponseCache.configure_mock_provider(:openrouter)
test_openrouter_format()

Advanced Usage

Streaming Response Caching:

# Streaming responses are automatically cached
{:ok, stream} = ExLLM.stream_chat(messages, provider: :openai)
chunks = Enum.to_list(stream)

# Later, mock can replay the exact same stream
ExLLM.ResponseCache.configure_mock_provider(:openai)
{:ok, cached_stream} = ExLLM.stream_chat(messages, provider: :mock)
# Returns identical streaming chunks

Interceptor Wrapping:

# Manually wrap API calls for caching
{:ok, response} = ExLLM.CachingInterceptor.with_caching(:openai, fn ->
  ExLLM.Adapters.OpenAI.chat(messages)
end)

# Wrap streaming calls
{:ok, stream} = ExLLM.CachingInterceptor.with_streaming_cache(
  :anthropic, 
  messages, 
  options, 
  fn -> ExLLM.Adapters.Anthropic.stream_chat(messages, options) end
)

Model Discovery

Finding Models

# Get model information
{:ok, info} = ExLLM.get_model_info(:openai, "gpt-4o")
IO.inspect(info)
# => %ExLLM.ModelCapabilities.ModelInfo{
#   id: "gpt-4o",
#   context_window: 128000,
#   max_output_tokens: 16384,
#   capabilities: %{
#     vision: %{supported: true},
#     function_calling: %{supported: true},
#     streaming: %{supported: true},
#     ...
#   }
# }

# Check specific capability
if ExLLM.model_supports?(:openai, "gpt-4o", :vision) do
  # Model supports vision
end

Model Recommendations

# Get recommendations based on requirements
recommendations = ExLLM.recommend_models(
  features: [:vision, :function_calling],
  min_context_window: 100_000,
  max_cost_per_1k_tokens: 1.0,
  prefer_local: false,
  limit: 5
)

for {provider, model, info} <- recommendations do
  IO.puts("#{provider}/#{model}")
  IO.puts("  Score: #{info.score}")
  IO.puts("  Context: #{info.context_window}")
  IO.puts("  Cost: $#{info.cost_per_1k}/1k tokens")
end

Finding Models by Feature

# Find all models with specific features
models = ExLLM.find_models_with_features([:vision, :streaming])
# => [
#   {:openai, "gpt-4o"},
#   {:anthropic, "claude-3-opus-20240229"},
#   ...
# ]

# Group models by capability
grouped = ExLLM.models_by_capability(:vision)
# => %{
#   supported: [{:openai, "gpt-4o"}, ...],
#   not_supported: [{:openai, "gpt-3.5-turbo"}, ...]
# }

Comparing Models

comparison = ExLLM.compare_models([
  {:openai, "gpt-4o"},
  {:anthropic, "claude-3-5-sonnet-20241022"},
  {:gemini, "gemini-1.5-pro"}
])

# See feature support across models
IO.inspect(comparison.features[:vision])
# => [
#   %{model: "gpt-4o", supported: true, details: %{...}},
#   %{model: "claude-3-5-sonnet", supported: true, details: %{...}},
#   %{model: "gemini-1.5-pro", supported: true, details: %{...}}
# ]

Provider Capabilities

Capability Normalization

ExLLM automatically normalizes different provider terminologies:

# These all work and refer to the same capability
ExLLM.provider_supports?(:openai, :function_calling)     # => true
ExLLM.provider_supports?(:anthropic, :tool_use)          # => true
ExLLM.provider_supports?(:openai, :tools)                # => true

# Find providers using any terminology
ExLLM.find_providers_with_features([:tool_use])          # Works!
ExLLM.find_providers_with_features([:function_calling])  # Also works!

Provider Discovery

# Get provider capabilities
{:ok, caps} = ExLLM.get_provider_capabilities(:openai)
IO.inspect(caps)
# => %ExLLM.ProviderCapabilities.ProviderInfo{
#   id: :openai,
#   name: "OpenAI",
#   endpoints: [:chat, :embeddings, :images, ...],
#   features: [:streaming, :function_calling, ...],
#   limitations: %{max_file_size: 512MB, ...}
# }

# Find providers by feature
providers = ExLLM.find_providers_with_features([:embeddings, :streaming])
# => [:openai, :gemini, :bedrock, ...]

# Check authentication requirements
if ExLLM.provider_requires_auth?(:openai) do
  # Provider needs API key
end

# Check if provider is local
if ExLLM.is_local_provider?(:ollama) do
  # No API costs
end

Provider Recommendations

recommendations = ExLLM.recommend_providers(%{
  required_features: [:vision, :streaming],
  preferred_features: [:embeddings, :function_calling],
  exclude_providers: [:mock],
  prefer_local: false,
  prefer_free: false
})

for %{provider: provider, score: score, matched_features: features} <- recommendations do
  IO.puts("#{provider}: #{Float.round(score, 2)}")
  IO.puts("  Features: #{Enum.join(features, ", ")}")
end

Comparing Providers

comparison = ExLLM.compare_providers([:openai, :anthropic, :gemini])

# See all features across providers
IO.puts("All features: #{Enum.join(comparison.features, ", ")}")

# Check specific provider capabilities
openai_features = comparison.comparison.openai.features
# => [:streaming, :function_calling, :embeddings, ...]

Logging

ExLLM provides a unified logging system with security features.

Basic Logging

alias ExLLM.Logger

# Log at different levels
Logger.debug("Starting chat request")
Logger.info("Chat completed in #{duration}ms")
Logger.warn("Rate limit approaching")
Logger.error("API request failed", error: reason)

Structured Logging

# Log with metadata
Logger.info("Chat completed",
  provider: :openai,
  model: "gpt-4o",
  tokens: 150,
  duration_ms: 523
)

# Context-aware logging
Logger.with_context(request_id: "abc123") do
  Logger.info("Processing request")
  # All logs in this block include request_id
end

Security Features

# API keys are automatically redacted
Logger.info("Using API key", api_key: "sk-1234567890")
# Logs: "Using API key [api_key: REDACTED]"

# Configure content filtering
Application.put_env(:ex_llm, :log_redact_messages, true)

Configuration

# In config/config.exs
config :ex_llm,
  log_level: :info,                    # Minimum level to log
  log_redact_keys: true,               # Redact API keys
  log_redact_messages: false,          # Don't log message content
  log_include_metadata: true,          # Include structured metadata
  log_filter_components: [:cache]      # Don't log from cache component

See the Logger User Guide for complete documentation.

Testing with Mock Adapter

The mock adapter helps you test LLM integrations without making real API calls.

Basic Mocking

# Start the mock adapter
{:ok, _} = ExLLM.Adapters.Mock.start_link()

# Configure mock response
{:ok, response} = ExLLM.chat(:mock, messages,
  mock_response: "This is a mock response"
)

assert response.content == "This is a mock response"

Dynamic Responses

# Use a handler function
{:ok, response} = ExLLM.chat(:mock, messages,
  mock_handler: fn messages, _options ->
    last_message = List.last(messages)
    %ExLLM.Types.LLMResponse{
      content: "You said: #{last_message.content}",
      model: "mock-model",
      usage: %{input_tokens: 10, output_tokens: 20}
    }
  end
)

Simulating Errors

# Simulate specific errors
{:error, error} = ExLLM.chat(:mock, messages,
  mock_error: %ExLLM.Error{
    type: :rate_limit,
    message: "Rate limit exceeded",
    retry_after: 60
  }
)

Streaming Mocks

{:ok, stream} = ExLLM.stream_chat(:mock, messages,
  mock_chunks: [
    %{content: "Hello"},
    %{content: " world"},
    %{content: "!", finish_reason: "stop"}
  ],
  chunk_delay: 100  # Milliseconds between chunks
)

for chunk <- stream do
  IO.write(chunk.content || "")
end

Request Capture

# Capture requests for assertions
ExLLM.Adapters.Mock.clear_requests()

{:ok, _} = ExLLM.chat(:mock, messages,
  capture_requests: true,
  mock_response: "OK"
)

requests = ExLLM.Adapters.Mock.get_requests()
assert length(requests) == 1
assert List.first(requests).messages == messages

Advanced Topics

Custom Adapters

Create your own adapter for unsupported providers:

defmodule MyApp.CustomAdapter do
  @behaviour ExLLM.Adapter
  
  @impl true
  def configured?(options) do
    # Check if adapter is properly configured
    config = get_config(options)
    config[:api_key] != nil
  end
  
  @impl true
  def default_model() do
    "custom-model-v1"
  end
  
  @impl true
  def chat(messages, options) do
    # Implement chat logic
    # Return {:ok, %ExLLM.Types.LLMResponse{}} or {:error, reason}
  end
  
  @impl true
  def stream_chat(messages, options) do
    # Return {:ok, stream} where stream yields StreamChunk structs
  end
  
  # Optional callbacks
  @impl true
  def list_models(options) do
    # Return {:ok, [%ExLLM.Types.Model{}]}
  end
  
  @impl true
  def embeddings(inputs, options) do
    # Return {:ok, %ExLLM.Types.EmbeddingResponse{}}
  end
end

Stream Processing

Advanced stream handling:

defmodule StreamProcessor do
  def process_with_buffer(provider, messages, opts) do
    {:ok, stream} = ExLLM.stream_chat(provider, messages, opts)
    
    stream
    |> Stream.scan("", fn chunk, buffer ->
      case chunk do
        %{content: nil} -> buffer
        %{content: text} -> buffer <> text
      end
    end)
    |> Stream.each(fn buffer ->
      # Process complete sentences
      if String.ends_with?(buffer, ".") do
        IO.puts("\nComplete: #{buffer}")
      end
    end)
    |> Stream.run()
  end
end

Token Budget Management

Manage token usage across multiple requests:

defmodule TokenBudget do
  use GenServer
  
  def init(budget) do
    {:ok, %{budget: budget, used: 0}}
  end
  
  def track_usage(pid, tokens) do
    GenServer.call(pid, {:track, tokens})
  end
  
  def handle_call({:track, tokens}, _from, state) do
    new_used = state.used + tokens
    if new_used <= state.budget do
      {:reply, :ok, %{state | used: new_used}}
    else
      {:reply, {:error, :budget_exceeded}, state}
    end
  end
end

# Use with ExLLM
{:ok, budget} = GenServer.start_link(TokenBudget, 10_000)

{:ok, response} = ExLLM.chat(:openai, messages)
:ok = TokenBudget.track_usage(budget, response.usage.total_tokens)

Multi-Provider Routing

Route requests to different providers based on criteria:

defmodule ProviderRouter do
  def route_request(messages, requirements) do
    cond do
      # Use local for development
      Mix.env() == :dev ->
        ExLLM.chat(:ollama, messages)
        
      # Use Groq for speed-critical requests
      requirements[:max_latency_ms] < 1000 ->
        ExLLM.chat(:groq, messages)
        
      # Use OpenAI for complex reasoning
      requirements[:complexity] == :high ->
        ExLLM.chat(:openai, messages, model: "gpt-4o")
        
      # Default to Anthropic
      true ->
        ExLLM.chat(:anthropic, messages)
    end
  end
end

Batch Processing

Process multiple requests efficiently:

defmodule BatchProcessor do
  def process_batch(items, opts \\ []) do
    # Use Task.async_stream for parallel processing
    items
    |> Task.async_stream(
      fn item ->
        ExLLM.chat(opts[:provider] || :openai, [
          %{role: "user", content: item}
        ])
      end,
      max_concurrency: opts[:concurrency] || 5,
      timeout: opts[:timeout] || 30_000
    )
    |> Enum.map(fn
      {:ok, {:ok, response}} -> {:ok, response}
      {:ok, {:error, reason}} -> {:error, reason}
      {:exit, reason} -> {:error, {:timeout, reason}}
    end)
  end
end

Custom Configuration Management

Implement advanced configuration strategies:

defmodule ConfigManager do
  use GenServer
  
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  
  def init(_opts) do
    # Load from multiple sources
    config = %{}
    |> load_from_env()
    |> load_from_file()
    |> load_from_vault()
    |> validate_config()
    
    {:ok, config}
  end
  
  def get_config(provider) do
    GenServer.call(__MODULE__, {:get, provider})
  end
  
  defp load_from_vault(config) do
    # Fetch from HashiCorp Vault, AWS Secrets Manager, etc.
    Map.merge(config, fetch_secrets())
  end
end

Best Practices

  1. Always handle errors - LLM APIs can fail for various reasons
  2. Use streaming for long responses - Better user experience
  3. Enable caching for repeated queries - Save costs
  4. Monitor token usage - Stay within budget
  5. Use appropriate models - Don't use GPT-4 for simple tasks
  6. Implement fallbacks - Have backup providers ready
  7. Test with mocks - Don't make API calls in tests
  8. Use context management - Handle long conversations properly
  9. Track costs - Monitor spending across providers
  10. Follow rate limits - Respect provider limitations

Troubleshooting

Common Issues

  1. "API key not found"

    • Check environment variables
    • Verify configuration provider is started
    • Use ExLLM.configured?/1 to debug
  2. "Context length exceeded"

    • Use context management strategies
    • Choose models with larger context windows
    • Truncate conversation history
  3. "Rate limit exceeded"

    • Enable automatic retry
    • Implement backoff strategies
    • Consider multiple API keys
  4. "Stream interrupted"

    • Enable stream recovery
    • Implement reconnection logic
    • Check network stability
  5. "Invalid response format"

    • Check provider documentation
    • Verify model capabilities
    • Use appropriate options

Debug Mode

Enable debug logging:

# In config
config :ex_llm, :log_level, :debug

# Or at runtime
Logger.configure(level: :debug)

Getting Help

Additional Resources