Vision input lets the model see images — a screenshot, a diagram, a photo — alongside the user's text. ALLM exposes a single multi-modal content shape that works across OpenAI, Anthropic, and Gemini: a list of %TextPart{} and %ImagePart{} values as the message content.

This guide covers the part structs, image-source variants (URL, raw bytes, file path), provider parity, and detail-level controls.

Multi-modal content

Instead of a plain string, a %Message{} content can be a list of content parts:

import ALLM, only: [user: 1]

msg = user([
  %ALLM.TextPart{text: "What's in this picture?"},
  %ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
])

The list form drops into ALLM.request/2 and ALLM.chat/3 exactly like a string.

ImagePart sources

%ImagePart{} accepts three source shapes:

# Public URL
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}

# Raw bytes (with required mime_type)
bytes = File.read!("/path/to/photo.png")
%ALLM.ImagePart{source: {:bytes, bytes}, mime_type: "image/png"}

# File path (read at adapter time)
%ALLM.ImagePart{source: {:file, "/path/to/photo.png"}}

Each adapter chooses the most efficient wire shape automatically:

  • OpenAI — accepts both URLs and base64 inline data via image_url.
  • Anthropic — accepts URLs and base64 source blocks. URL-only models fall back to a wire-side fetch + inline.
  • Gemini — uploads bytes inline via inlineData. URL inputs are fetched and inlined client-side.

Round-trip example

iex> engine = ALLM.Engine.new(
...>   adapter: ALLM.Providers.Fake,
...>   adapter_opts: [script: [{:text, "A red square."}, {:finish, :stop}]]
...> )
iex> msg = ALLM.user([
...>   %ALLM.TextPart{text: "Describe this."},
...>   %ALLM.ImagePart{source: {:url, "https://example.com/red.png"}}
...> ])
iex> {:ok, %ALLM.Response{output_text: text}} =
...>   ALLM.generate(engine, ALLM.request([msg]))
iex> text
"A red square."

Fake doesn't actually look at the image — it just returns the scripted text. With a real provider, the image content reaches the model.

Detail levels (OpenAI)

OpenAI's vision models accept a per-image detail hint:

%ALLM.ImagePart{
  source: {:url, "https://example.com/photo.png"},
  detail: :high  # :auto | :low | :high
}
  • :auto (default) — model decides based on image dimensions.
  • :low — fixed 512×512 representation, cheaper.
  • :high — full resolution, expensive but accurate for fine detail.

Anthropic and Gemini ignore the :detail field — their vision tiers don't expose an equivalent knob. ALLM passes the field along to OpenAI unchanged and silently drops it for the others (with a :debug-level log on the first drop per process).

Provider parity

FeatureOpenAIAnthropicGemini
URL sourceyesyes (some models)client-side fetch
Raw bytes / base64yesyesyes
File pathyes (read client-side)yesyes
Image in :system rolerejected (raises)rejectedrejected
:detail fieldhonoreddroppeddropped
Per-message multi-imageyesyesyes

The "image in :system role" check is a pre-flight validation in every adapter — the wire formats reject it (or behave inconsistently), so ALLM raises a clear ALLM.Error.ValidationError before dispatch instead of letting you debug an opaque 400.

Common patterns

Screenshot OCR

{:ok, response} = ALLM.generate(engine, ALLM.request([
  ALLM.system("Extract every word visible in the image. Reply with a JSON array of strings."),
  ALLM.user([
    %ALLM.TextPart{text: "Extract text from this screenshot:"},
    %ALLM.ImagePart{source: {:file, "/tmp/screenshot.png"}}
  ])
]))

Multi-image comparison

{:ok, response} = ALLM.generate(engine, ALLM.request([
  ALLM.user([
    %ALLM.TextPart{text: "Which of these two images has more red?"},
    %ALLM.ImagePart{source: {:url, "https://example.com/a.png"}},
    %ALLM.ImagePart{source: {:url, "https://example.com/b.png"}}
  ])
]))

Streaming a vision response

stream_generate/3 works identically with vision input — the request shape is the same, and you get text deltas back as the model incrementally describes the image.

data: URI input

When you already have a data:image/<mime>;base64,<payload> string (a browser upload, a FileReader.readAsDataURL result, a clipboard paste), use ALLM.Image.from_data_uri/1 — it parses the URI into a {:base64, encoded} source plus the explicit MIME, ready for %ALLM.ImagePart{}:

img = ALLM.Image.from_data_uri("data:image/png;base64,iVBORw0KGgo...")

ALLM.user([
  %ALLM.TextPart{text: "Describe this:"},
  %ALLM.ImagePart{image: img}
])

from_data_uri/1 only accepts the standard ;base64,<payload> form — URL-encoded payloads (data:<mime>,<urlencoded>) raise ArgumentError. from_url/1 is for http(s):// URLs only; it does NOT accept data: URIs.

File size and MIME limits

Each provider has its own limits (Anthropic caps base64 image bytes at ~5MB per image; OpenAI caps at 20MB; Gemini at 7MB). Adapters validate size pre-flight and raise ALLM.Error.ValidationError with a clear reason if you exceed it. Compress or resize before sending.

Supported MIME types (intersection across providers): image/png, image/jpeg, image/gif, image/webp. Adapters reject other types pre-flight.

Where to next