Vision input lets the model see images — a screenshot, a diagram, a
photo — alongside the user's text. ALLM exposes a single multi-modal
content shape that works across OpenAI, Anthropic, and Gemini: a list
of %TextPart{} and %ImagePart{} values as the message content.
This guide covers the part structs, image-source variants (URL, raw bytes, file path), provider parity, and detail-level controls.
Multi-modal content
Instead of a plain string, a %Message{} content can be a list of
content parts:
import ALLM, only: [user: 1]
msg = user([
%ALLM.TextPart{text: "What's in this picture?"},
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
])The list form drops into ALLM.request/2 and ALLM.chat/3 exactly
like a string.
ImagePart sources
%ImagePart{} accepts three source shapes:
# Public URL
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
# Raw bytes (with required mime_type)
bytes = File.read!("/path/to/photo.png")
%ALLM.ImagePart{source: {:bytes, bytes}, mime_type: "image/png"}
# File path (read at adapter time)
%ALLM.ImagePart{source: {:file, "/path/to/photo.png"}}Each adapter chooses the most efficient wire shape automatically:
- OpenAI — accepts both URLs and base64 inline data via
image_url. - Anthropic — accepts URLs and base64
sourceblocks. URL-only models fall back to a wire-side fetch + inline. - Gemini — uploads bytes inline via
inlineData. URL inputs are fetched and inlined client-side.
Round-trip example
iex> engine = ALLM.Engine.new(
...> adapter: ALLM.Providers.Fake,
...> adapter_opts: [script: [{:text, "A red square."}, {:finish, :stop}]]
...> )
iex> msg = ALLM.user([
...> %ALLM.TextPart{text: "Describe this."},
...> %ALLM.ImagePart{source: {:url, "https://example.com/red.png"}}
...> ])
iex> {:ok, %ALLM.Response{output_text: text}} =
...> ALLM.generate(engine, ALLM.request([msg]))
iex> text
"A red square."Fake doesn't actually look at the image — it just returns the scripted text. With a real provider, the image content reaches the model.
Detail levels (OpenAI)
OpenAI's vision models accept a per-image detail hint:
%ALLM.ImagePart{
source: {:url, "https://example.com/photo.png"},
detail: :high # :auto | :low | :high
}:auto(default) — model decides based on image dimensions.:low— fixed 512×512 representation, cheaper.:high— full resolution, expensive but accurate for fine detail.
Anthropic and Gemini ignore the :detail field — their vision tiers
don't expose an equivalent knob. ALLM passes the field along to OpenAI
unchanged and silently drops it for the others (with a :debug-level
log on the first drop per process).
Provider parity
| Feature | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| URL source | yes | yes (some models) | client-side fetch |
| Raw bytes / base64 | yes | yes | yes |
| File path | yes (read client-side) | yes | yes |
Image in :system role | rejected (raises) | rejected | rejected |
:detail field | honored | dropped | dropped |
| Per-message multi-image | yes | yes | yes |
The "image in :system role" check is a pre-flight validation in every
adapter — the wire formats reject it (or behave inconsistently), so
ALLM raises a clear ALLM.Error.ValidationError before dispatch
instead of letting you debug an opaque 400.
Common patterns
Screenshot OCR
{:ok, response} = ALLM.generate(engine, ALLM.request([
ALLM.system("Extract every word visible in the image. Reply with a JSON array of strings."),
ALLM.user([
%ALLM.TextPart{text: "Extract text from this screenshot:"},
%ALLM.ImagePart{source: {:file, "/tmp/screenshot.png"}}
])
]))Multi-image comparison
{:ok, response} = ALLM.generate(engine, ALLM.request([
ALLM.user([
%ALLM.TextPart{text: "Which of these two images has more red?"},
%ALLM.ImagePart{source: {:url, "https://example.com/a.png"}},
%ALLM.ImagePart{source: {:url, "https://example.com/b.png"}}
])
]))Streaming a vision response
stream_generate/3 works identically with vision input — the request
shape is the same, and you get text deltas back as the model
incrementally describes the image.
File size and MIME limits
Each provider has its own limits (Anthropic caps base64 image bytes at
~5MB per image; OpenAI caps at 20MB; Gemini at 7MB). Adapters validate
size pre-flight and raise ALLM.Error.ValidationError with a clear
reason if you exceed it. Compress or resize before sending.
Supported MIME types (intersection across providers): image/png,
image/jpeg, image/gif, image/webp. Adapters reject other types
pre-flight.
Where to next
image_generation.md— the parallel:image_adapterslot for generating new images.examples/12_vision_input.exs— runnable smoke test against any of the three providers.- The
ALLM.Providers.OpenAI,ALLM.Providers.Anthropic, andALLM.Providers.Geminimodule docs cover the per-provider quirks.