Vision input lets the model see images — a screenshot, a diagram, a photo — alongside the user's text. ALLM exposes a single multi-modal content shape that works across OpenAI, Anthropic, and Gemini: a list of %TextPart{} and %ImagePart{} values as the message content.

This guide covers the part structs, image-source variants (URL, raw bytes, file path), provider parity, and detail-level controls.

Multi-modal content

Instead of a plain string, a %Message{} content can be a list of content parts:

import ALLM, only: [user: 1]

msg = user([
  %ALLM.TextPart{text: "What's in this picture?"},
  %ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
])

The list form drops into ALLM.request/2 and ALLM.chat/3 exactly like a string.

ImagePart sources

%ImagePart{} accepts three source shapes:

# Public URL
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}

# Raw bytes (with required mime_type)
bytes = File.read!("/path/to/photo.png")
%ALLM.ImagePart{source: {:bytes, bytes}, mime_type: "image/png"}

# File path (read at adapter time)
%ALLM.ImagePart{source: {:file, "/path/to/photo.png"}}

Each adapter chooses the most efficient wire shape automatically:

  • OpenAI — accepts both URLs and base64 inline data via image_url.
  • Anthropic — accepts URLs and base64 source blocks. URL-only models fall back to a wire-side fetch + inline.
  • Gemini — uploads bytes inline via inlineData. URL inputs are fetched and inlined client-side.

Round-trip example

iex> engine = ALLM.Engine.new(
...>   adapter: ALLM.Providers.Fake,
...>   adapter_opts: [script: [{:text, "A red square."}, {:finish, :stop}]]
...> )
iex> msg = ALLM.user([
...>   %ALLM.TextPart{text: "Describe this."},
...>   %ALLM.ImagePart{source: {:url, "https://example.com/red.png"}}
...> ])
iex> {:ok, %ALLM.Response{output_text: text}} =
...>   ALLM.generate(engine, ALLM.request([msg]))
iex> text
"A red square."

Fake doesn't actually look at the image — it just returns the scripted text. With a real provider, the image content reaches the model.

Detail levels (OpenAI)

OpenAI's vision models accept a per-image detail hint:

%ALLM.ImagePart{
  source: {:url, "https://example.com/photo.png"},
  detail: :high  # :auto | :low | :high
}
  • :auto (default) — model decides based on image dimensions.
  • :low — fixed 512×512 representation, cheaper.
  • :high — full resolution, expensive but accurate for fine detail.

Anthropic and Gemini ignore the :detail field — their vision tiers don't expose an equivalent knob. ALLM passes the field along to OpenAI unchanged and silently drops it for the others (with a :debug-level log on the first drop per process).

Provider parity

FeatureOpenAIAnthropicGemini
URL sourceyesyes (some models)client-side fetch
Raw bytes / base64yesyesyes
File pathyes (read client-side)yesyes
Image in :system rolerejected (raises)rejectedrejected
:detail fieldhonoreddroppeddropped
Per-message multi-imageyesyesyes

The "image in :system role" check is a pre-flight validation in every adapter — the wire formats reject it (or behave inconsistently), so ALLM raises a clear ALLM.Error.ValidationError before dispatch instead of letting you debug an opaque 400.

Common patterns

Screenshot OCR

{:ok, response} = ALLM.generate(engine, ALLM.request([
  ALLM.system("Extract every word visible in the image. Reply with a JSON array of strings."),
  ALLM.user([
    %ALLM.TextPart{text: "Extract text from this screenshot:"},
    %ALLM.ImagePart{source: {:file, "/tmp/screenshot.png"}}
  ])
]))

Multi-image comparison

{:ok, response} = ALLM.generate(engine, ALLM.request([
  ALLM.user([
    %ALLM.TextPart{text: "Which of these two images has more red?"},
    %ALLM.ImagePart{source: {:url, "https://example.com/a.png"}},
    %ALLM.ImagePart{source: {:url, "https://example.com/b.png"}}
  ])
]))

Streaming a vision response

stream_generate/3 works identically with vision input — the request shape is the same, and you get text deltas back as the model incrementally describes the image.

File size and MIME limits

Each provider has its own limits (Anthropic caps base64 image bytes at ~5MB per image; OpenAI caps at 20MB; Gemini at 7MB). Adapters validate size pre-flight and raise ALLM.Error.ValidationError with a clear reason if you exceed it. Compress or resize before sending.

Supported MIME types (intersection across providers): image/png, image/jpeg, image/gif, image/webp. Adapters reject other types pre-flight.

Where to next