All LLM requests in this library go through the streaming path. Even generate/2 starts a stream and collects it internally. This guide covers how to work with streams directly.

Starting a stream

{:ok, stream} = LLM.stream("Tell me a story",
  provider: :openai,
  model: "gpt-4"
)

The returned stream is an LLM.Stream.t() struct containing the HTTP response reference, adapter state, and configuration.

Collecting chunks

The simplest way to consume a stream is LLM.Stream.collect/2:

{:ok, response} = LLM.Stream.collect(stream)
response.message.content
#=> "Once upon a time..."

With a callback

Process each chunk as it arrives:

{:ok, response} = LLM.Stream.collect(stream,
  on_chunk: fn
    %LLM.Stream.Chunk{text: text} -> IO.write(text)
    %LLM.Stream.Thinking{text: text} -> IO.puts("\n[thinking] #{text}")
    %LLM.Stream.Stop{reason: reason} -> IO.puts("\n[stopped: #{reason}]")
    _ -> :ok
  end
)

Controlling tool execution

By default, collect/2 automatically executes tool calls and loops. Disable this with:

{:ok, response} = LLM.Stream.collect(stream, auto_tools: false)

Limit the number of tool call rounds:

{:ok, response} = LLM.Stream.collect(stream, max_rounds: 3)

Manual chunk processing

For fine-grained control, use LLM.Stream.next/1:

case LLM.Stream.next(stream) do
  {:ok, chunks, stream} ->
    Enum.each(chunks, fn
      %LLM.Stream.Chunk{text: text} -> IO.write(text)
      %LLM.Stream.ToolCall{name: name, arguments: args} ->
        IO.puts("\nTool call: #{name}(#{inspect(args)})")
      _ -> :ok
    end)
    # Continue with next/1...

  {:halt, stream} ->
    IO.puts("\nStream finished")

  {:error, reason} ->
    IO.puts("\nError: #{inspect(reason)}")
end

Chunk types

Streaming responses produce a mix of chunk types:

ChunkModuleFieldsDescription
TextLLM.Stream.Chunktext, indexA piece of generated text
ThinkingLLM.Stream.Thinkingtext, signatureReasoning/thinking content
Tool callLLM.Stream.ToolCallid, name, arguments, index, completeA tool invocation
StopLLM.Stream.Stopreason, usageStream ended
ErrorLLM.Stream.Errormessage, codeAn error occurred

Text chunks

%LLM.Stream.Chunk{text: "Hello", index: 0}

Thinking chunks

Providers that support reasoning (Anthropic, Gemini, OpenAI with o-series) emit thinking chunks:

%LLM.Stream.Thinking{text: "Let me consider...", signature: nil}

Anthropic thinking blocks include a signature field for verification:

%LLM.Stream.Thinking{
  text: "The user is asking about...",
  signature: "EqQBCgIYAhIM..."
}

Tool call chunks

Tool calls arrive as chunks with complete: true when the full call is available:

%LLM.Stream.ToolCall{
  id: "call_abc123",
  name: "get_weather",
  arguments: %{"location" => "San Francisco"},
  index: 0,
  complete: true
}

During streaming, partial tool call arguments may arrive in multiple chunks. The library accumulates them internally and only emits a chunk when complete: true.

Stop chunks

%LLM.Stream.Stop{reason: :stop, usage: %LLM.Usage{input_tokens: 10, output_tokens: 50}}

Common stop reasons:

  • :stop — natural end of response
  • :length — hit max_tokens limit
  • :tool_calls — model wants to call tools (stream continues)
  • :content_filter — content was filtered

Stream timeout

The default timeout is 30 seconds. Override it in the options:

{:ok, stream} = LLM.stream("Long generation", provider: :openai, model: "gpt-4")
# The stream struct's timeout can be set via opts

Error handling

case LLM.stream("Hello", provider: :openai, model: "gpt-4") do
  {:ok, stream} ->
    case LLM.Stream.collect(stream) do
      {:ok, response} -> handle_response(response)
      {:error, reason} -> handle_error(reason)
    end

  {:error, reason} ->
    handle_error(reason)
end

Common errors:

  • :timeout — no data received within the timeout period
  • %{status: 401, body: ...} — authentication failure
  • %{status: 429, body: ...} — rate limit exceeded
  • %{status: 500, body: ...} — server error

Combining streaming with tools

When using collect/2 with the default auto_tools: true, the library will:

  1. Collect all chunks from the stream
  2. If tool calls are present, execute them
  3. Append the tool results to the conversation
  4. Start a new stream with the updated context
  5. Repeat until no more tool calls or max_rounds is reached

This is transparent — the final LLM.Response contains the complete message after all tool call rounds.

{:ok, response} = LLM.generate("Read mix.exs and summarize it",
  provider: :openai,
  model: "gpt-4",
  tools: [MyApp.ReadFileTool, MyApp.SummarizeTool],
  max_rounds: 5
)

# The response includes text from after all tool executions
response.message.content
#=> "The mix.exs file defines..."

Next steps