ExLLM.Adapters.Local (ex_llm v0.1.0)

View Source

Local LLM adapter using Bumblebee for on-device inference.

This adapter enables running language models locally using Bumblebee and EXLA/EMLX backends. It supports Apple Silicon (via EMLX), NVIDIA GPUs (via CUDA), and CPU inference.

Configuration

The local adapter doesn't require API keys but may need backend configuration:

# For Apple Silicon (automatic detection)
{:ok, response} = ExLLM.chat(:local, messages)

# With specific model
{:ok, response} = ExLLM.chat(:local, messages, model: "microsoft/phi-2")

Available Models

  • microsoft/phi-2 - Phi-2 (2.7B) - Default
  • meta-llama/Llama-2-7b-hf - Llama 2 (7B)
  • mistralai/Mistral-7B-v0.1 - Mistral (7B)
  • EleutherAI/gpt-neo-1.3B - GPT-Neo (1.3B)
  • google/flan-t5-base - Flan-T5 Base

Features

  • On-device inference with no API calls
  • Automatic hardware acceleration detection
  • Support for Apple Silicon, NVIDIA GPUs, and CPUs
  • Model caching for faster subsequent loads
  • Streaming support for real-time generation