Vision input lets the model see images — a screenshot, a diagram, a
photo — alongside the user's text. ALLM exposes a single multi-modal
content shape that works across OpenAI, Anthropic, and Gemini: a list
of %TextPart{} and %ImagePart{} values as the message content.
This guide covers the part structs, image-source variants (URL, raw bytes, file path), provider parity, and detail-level controls.
Multi-modal content
Instead of a plain string, a %Message{} content can be a list of
content parts:
import ALLM, only: [user: 1]
msg = user([
%ALLM.TextPart{text: "What's in this picture?"},
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
])The list form drops into ALLM.request/2 and ALLM.chat/3 exactly
like a string.
ImagePart sources
%ImagePart{} accepts three source shapes:
# Public URL
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
# Raw bytes (with required mime_type)
bytes = File.read!("/path/to/photo.png")
%ALLM.ImagePart{source: {:bytes, bytes}, mime_type: "image/png"}
# File path (read at adapter time)
%ALLM.ImagePart{source: {:file, "/path/to/photo.png"}}Each adapter chooses the most efficient wire shape automatically:
- OpenAI — accepts both URLs and base64 inline data via
image_url. - Anthropic — accepts URLs and base64
sourceblocks. URL-only models fall back to a wire-side fetch + inline. - Gemini — uploads bytes inline via
inlineData. URL inputs are fetched and inlined client-side.
Round-trip example
iex> engine = ALLM.Engine.new(
...> adapter: ALLM.Providers.Fake,
...> adapter_opts: [script: [{:text, "A red square."}, {:finish, :stop}]]
...> )
iex> msg = ALLM.user([
...> %ALLM.TextPart{text: "Describe this."},
...> %ALLM.ImagePart{source: {:url, "https://example.com/red.png"}}
...> ])
iex> {:ok, %ALLM.Response{output_text: text}} =
...> ALLM.generate(engine, ALLM.request([msg]))
iex> text
"A red square."Fake doesn't actually look at the image — it just returns the scripted text. With a real provider, the image content reaches the model.
Detail levels (OpenAI)
OpenAI's vision models accept a per-image detail hint:
%ALLM.ImagePart{
source: {:url, "https://example.com/photo.png"},
detail: :high # :auto | :low | :high
}:auto(default) — model decides based on image dimensions.:low— fixed 512×512 representation, cheaper.:high— full resolution, expensive but accurate for fine detail.
Anthropic and Gemini ignore the :detail field — their vision tiers
don't expose an equivalent knob. ALLM passes the field along to OpenAI
unchanged and silently drops it for the others (with a :debug-level
log on the first drop per process).
Provider parity
| Feature | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| URL source | yes | yes (some models) | client-side fetch |
| Raw bytes / base64 | yes | yes | yes |
| File path | yes (read client-side) | yes | yes |
Image in :system role | rejected (raises) | rejected | rejected |
:detail field | honored | dropped | dropped |
| Per-message multi-image | yes | yes | yes |
The "image in :system role" check is a pre-flight validation in every
adapter — the wire formats reject it (or behave inconsistently), so
ALLM raises a clear ALLM.Error.ValidationError before dispatch
instead of letting you debug an opaque 400.
Common patterns
Screenshot OCR
{:ok, response} = ALLM.generate(engine, ALLM.request([
ALLM.system("Extract every word visible in the image. Reply with a JSON array of strings."),
ALLM.user([
%ALLM.TextPart{text: "Extract text from this screenshot:"},
%ALLM.ImagePart{source: {:file, "/tmp/screenshot.png"}}
])
]))Multi-image comparison
{:ok, response} = ALLM.generate(engine, ALLM.request([
ALLM.user([
%ALLM.TextPart{text: "Which of these two images has more red?"},
%ALLM.ImagePart{source: {:url, "https://example.com/a.png"}},
%ALLM.ImagePart{source: {:url, "https://example.com/b.png"}}
])
]))Streaming a vision response
stream_generate/3 works identically with vision input — the request
shape is the same, and you get text deltas back as the model
incrementally describes the image.
data: URI input
When you already have a data:image/<mime>;base64,<payload> string (a
browser upload, a FileReader.readAsDataURL result, a clipboard paste),
use ALLM.Image.from_data_uri/1 — it parses the URI into a
{:base64, encoded} source plus the explicit MIME, ready for
%ALLM.ImagePart{}:
img = ALLM.Image.from_data_uri("data:image/png;base64,iVBORw0KGgo...")
ALLM.user([
%ALLM.TextPart{text: "Describe this:"},
%ALLM.ImagePart{image: img}
])from_data_uri/1 only accepts the standard ;base64,<payload> form —
URL-encoded payloads (data:<mime>,<urlencoded>) raise
ArgumentError. from_url/1 is for http(s):// URLs only; it does
NOT accept data: URIs.
File size and MIME limits
Each provider has its own limits (Anthropic caps base64 image bytes at
~5MB per image; OpenAI caps at 20MB; Gemini at 7MB). Adapters validate
size pre-flight and raise ALLM.Error.ValidationError with a clear
reason if you exceed it. Compress or resize before sending.
Supported MIME types (intersection across providers): image/png,
image/jpeg, image/gif, image/webp. Adapters reject other types
pre-flight.
Where to next
image_generation.md— the parallel:image_adapterslot for generating new images.examples/12_vision_input.exs— runnable smoke test against any of the three providers.- The
ALLM.Providers.OpenAI,ALLM.Providers.Anthropic, andALLM.Providers.Geminimodule docs cover the per-provider quirks.