Multimodal (Vision)

Copy Markdown View Source

ExAthena supports multimodal messages — text plus images — through the ExAthena.Messages.ContentPart struct. Two entry points are available: the ergonomic images: shorthand for quick one-liners, and full ContentPart construction for complex payloads.

Quick start: images: shorthand

Pass images: [...] to ExAthena.query/2, ExAthena.stream/3, or ExAthena.run/2 alongside a prompt string:

png = File.read!("diagram.png")

{:ok, response} =
  ExAthena.query("Describe what you see",
    provider: :ollama,
    model: "llava",
    images: [%{data: png, media_type: "image/png"}]
  )

IO.puts(response.text)

Each entry in the images: list may be one of:

ShapeDescription
%{data: binary(), media_type: String.t()}Inline image bytes
%{data: binary()}Inline image, media type defaults to "image/png"
%{url: String.t()}Remote image URL

ExAthena builds a multimodal user message with the text part first, followed by the image parts. When no prompt is given, the images are merged into the last user message in :messages, or appended as a new user message.

Full ContentPart approach

For finer control — mixing text, images, and files in arbitrary order — build ContentPart structs directly and pass them as the message content:

alias ExAthena.Messages
alias ExAthena.Messages.ContentPart

png = File.read!("chart.png")
pdf = File.read!("report.pdf")

parts = [
  ContentPart.text("Summarize the chart and cross-reference the report:"),
  ContentPart.image(png, "image/png"),
  ContentPart.file(pdf, "report.pdf", "application/pdf")
]

{:ok, response} =
  ExAthena.query(nil,
    provider: :claude,
    model: "claude-opus-4-7",
    messages: [Messages.user(parts)]
  )

ContentPart factory functions

FunctionTypeFields
ContentPart.text(content):texttext
ContentPart.image(data, media_type \\ "image/png"):imagedata, media_type
ContentPart.image_url(url):image_urlurl
ContentPart.file(data, filename, media_type \\ "application/octet-stream"):filedata, filename, media_type

Provider examples

Ollama (llava, qwen2-vl)

# config/config.exs
config :ex_athena, :ollama,
  base_url: "http://localhost:11434",
  model: "llava"

# usage
png = File.read!("screenshot.png")

{:ok, response} =
  ExAthena.query("What is shown in this screenshot?",
    provider: :ollama,
    model: "llava",
    images: [%{data: png, media_type: "image/png"}]
  )

Pull a vision-capable model first:

ollama pull llava
# or
ollama pull qwen2-vl

Ollama vision support is model-dependent. Non-vision models will return an error or silently ignore image parts.

OpenAI-compatible (gpt-4o)

{:ok, response} =
  ExAthena.query("What's in this image?",
    provider: :openai_compatible,
    model: "gpt-4o",
    images: [%{url: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}]
  )

For inline images with the OpenAI API:

png = File.read!("photo.jpg")

{:ok, response} =
  ExAthena.query("Describe the photo",
    provider: :openai_compatible,
    model: "gpt-4o-mini",
    images: [%{data: png, media_type: "image/jpeg"}]
  )

Anthropic Claude

png = File.read!("diagram.png")

{:ok, response} =
  ExAthena.query("Explain this architecture diagram",
    provider: :claude,
    model: "claude-opus-4-7",
    images: [%{data: png, media_type: "image/png"}]
  )

Claude supports PNG, JPEG, GIF, and WebP. Maximum image size is 5 MB per image.

Google Gemini

png = File.read!("chart.png")

{:ok, response} =
  ExAthena.query("What trend does this chart show?",
    provider: :gemini,
    model: "gemini-2.5-flash",
    images: [%{data: png, media_type: "image/png"}]
  )

Using images: in the agent loop

ExAthena.run/2 forwards images: to Request.new/2 so the first turn has the image attached:

png = File.read!("codebase_diagram.png")

{:ok, result} =
  ExAthena.run("Implement the architecture shown in this diagram",
    provider: :claude,
    model: "claude-opus-4-7",
    cwd: "/path/to/project",
    images: [%{data: png, media_type: "image/png"}]
  )

Image format notes

  • Inline images are sent as base64-encoded data to the provider. The req_llm adapter handles encoding transparently.
  • Image URLs (%{url: ...}) are forwarded as-is. The provider fetches the image at inference time. Not all providers support URL references — prefer inline for maximum compatibility.
  • media_type should match the actual image format ("image/png", "image/jpeg", "image/gif", "image/webp"). Some providers are lenient; others require an accurate MIME type.
  • Multiple images in one message are supported by all major providers (Claude, OpenAI, Gemini). Ollama support is model-dependent.

Vision support by provider

ProviderVision supportNotes
:ollamaModel-dependentllava, qwen2-vl, llava-phi3, bakllava
:openai_compatiblegpt-4o, gpt-4o-miniURL + inline; other OAI-compat endpoints vary
:claude✅ Any claude-3+ modelPNG, JPEG, GIF, WebP; max 5 MB per image
:gemini✅ Any gemini-1.5+ modelInline + URL; very generous size limits