LLM.stream/3 returns the final LLM.Response directly — it opens the stream, collects chunks, executes tool calls, and assembles the response automatically.

Basic usage

{:ok, response} = LLM.stream("Tell me a story",
  provider: :openai,
  model: "gpt-4"
)

response.message.content
#=> "Once upon a time..."

The first argument is the prompt, the second is the options keyword list, and the third is an optional callbacks keyword list.

Callbacks

Process each chunk as it arrives:

{:ok, response} = LLM.stream("Write a poem",
  [provider: :openai, model: "gpt-4"],
  on_chunk: fn
    %LLM.Stream.Chunk{text: text} -> IO.write(text)
    %LLM.Stream.Thinking{text: text} -> IO.puts("\n[thinking] #{text}")
    %LLM.Stream.Stop{reason: reason} -> IO.puts("\n[stopped: #{reason}]")
    _ -> :ok
  end
)

on_message

The :on_message callback fires once per completed LLM.Message in conversation order — every assistant turn and every tool result:

{:ok, response} = LLM.stream("What's the weather?",
  [provider: :openai, model: "gpt-4", tools: [WeatherTool]],
  on_message: fn
    %LLM.Message{role: :assistant} = msg -> IO.puts("assistant: #{msg.content}")
    %LLM.Message{role: :tool} = msg -> IO.puts("tool #{msg.name}: #{msg.content}")
  end
)

The full reconstructed conversation is also available afterwards as response.messages.

Avoid sending messages to the calling process's mailbox from inside the callback, as the stream's receive loop will consume and discard unknown messages — use an Agent or ETS table to accumulate instead.

Controlling tool execution

By default, stream/3 automatically executes tool calls and loops. Disable this with:

{:ok, response} = LLM.stream("Hello",
  provider: :openai,
  model: "gpt-4",
  tools: [MyTool],
  auto_tools: false
)

Limit the number of tool call rounds:

{:ok, response} = LLM.stream("Hello",
  provider: :openai,
  model: "gpt-4",
  tools: [MyTool],
  max_rounds: 3
)

stream!

The bang variant raises on error:

response = LLM.stream!("Hello", provider: :openai, model: "gpt-4")

Chunk types

Streaming responses produce a mix of chunk types:

ChunkModuleFieldsDescription
TextLLM.Stream.Chunktext, indexA piece of generated text
ThinkingLLM.Stream.Thinkingtext, signatureReasoning/thinking content
Tool callLLM.Stream.ToolCallid, name, arguments, index, completeA tool invocation
StopLLM.Stream.Stopreason, usageStream ended
ErrorLLM.Stream.Errormessage, codeAn error occurred

Text chunks

%LLM.Stream.Chunk{text: "Hello", index: 0}

Thinking chunks

Providers that support reasoning (Anthropic, Gemini, OpenAI with o-series) emit thinking chunks:

%LLM.Stream.Thinking{text: "Let me consider...", signature: nil}

Anthropic thinking blocks include a signature field for verification:

%LLM.Stream.Thinking{
  text: "The user is asking about...",
  signature: "EqQBCgIYAhIM..."
}

Tool call chunks

Tool calls arrive as chunks with complete: true when the full call is available:

%LLM.Stream.ToolCall{
  id: "call_abc123",
  name: "get_weather",
  arguments: %{"location" => "San Francisco"},
  index: 0,
  complete: true
}

During streaming, partial tool call arguments may arrive in multiple chunks. The library accumulates them internally and only emits a chunk when complete: true.

Stop chunks

%LLM.Stream.Stop{reason: :stop, usage: %LLM.Usage{input_tokens: 10, output_tokens: 50}}

Common stop reasons:

  • :stop — natural end of response
  • :length — hit max_tokens limit
  • :tool_calls — model wants to call tools (stream continues)
  • :content_filter — content was filtered

Structured output

When schema: is set, the model returns JSON matching the given schema. Use stream/3 to get streaming with automatic parsing:

{:ok, response} = LLM.stream("Extract name and age",
  [provider: :openai, model: "gpt-4o", schema: %{name: "person", schema: schema}],
  on_chunk: &IO.write/1
)

response.parsed  #=> %{"name" => "Alice", "age" => 30}

Error handling

case LLM.stream("Hello", provider: :openai, model: "gpt-4") do
  {:ok, response} -> handle_response(response)
  {:error, reason} -> handle_error(reason)
end

Common errors:

  • :timeout — no data received within the timeout period
  • %{status: 401, body: ...} — authentication failure
  • %{status: 429, body: ...} — rate limit exceeded
  • %{status: 500, body: ...} — server error

Manual chunk processing (advanced)

For fine-grained control, use LLM.Stream.start/2 and LLM.Stream.next/1:

{:ok, stream} = LLM.Stream.start(context, opts)

case LLM.Stream.next(stream) do
  {:ok, chunks, stream} ->
    Enum.each(chunks, fn
      %LLM.Stream.Chunk{text: text} -> IO.write(text)
      %LLM.Stream.ToolCall{name: name, arguments: args} ->
        IO.puts("\nTool call: #{name}(#{inspect(args)})")
      _ -> :ok
    end)

  {:halt, stream} ->
    IO.puts("\nStream finished")

  {:error, reason} ->
    IO.puts("\nError: #{inspect(reason)}")
end

Next steps