Fireworks AI

View Source

High-throughput inference for open-weight models (DeepSeek, GLM, Kimi, MiniMax, GPT-OSS, Qwen, and others) on Fireworks's hardware. OpenAI-compatible chat endpoint with Fireworks-specific extensions.

Configuration

FIREWORKS_API_KEY=fw_...

Or programmatically:

ReqLLM.put_key(:fireworks_ai_api_key, "fw_...")

Model Specs

For the full model-spec workflow, see Model Specs.

Fireworks model IDs are path-like — accounts/fireworks/models/<slug>. Use exact Fireworks IDs from LLMDB.xyz when possible. If you need to use a model ID before it lands in the registry, use ReqLLM.model!/1:

ReqLLM.generate_text(
  "fireworks_ai:accounts/fireworks/models/kimi-k2p5",
  "Hello!"
)

# Or inline for models not yet in LLMDB:
model = ReqLLM.model!(%{
  provider: :fireworks_ai,
  id: "accounts/fireworks/models/some-new-model"
})
ReqLLM.generate_text(model, "Hello!")

Provider Options

Passed at the top level or nested under :provider_options.

min_p

  • Type: float
  • Purpose: Minimum-probability sampling parameter
  • Example: min_p: 0.05

repetition_penalty

  • Type: float
  • Purpose: Repetition penalty applied during decoding
  • Example: repetition_penalty: 1.1

typical_p

  • Type: float
  • Purpose: Typical-p sampling parameter

mirostat_target

  • Type: float
  • Purpose: Target entropy for Mirostat sampling

mirostat_lr

  • Type: float
  • Purpose: Learning rate for Mirostat sampling

parallel_tool_calls

  • Type: boolean
  • Purpose: Allow concurrent tool invocations
  • Example: parallel_tool_calls: true

max_completion_tokens

  • Type: positive integer
  • Purpose: Reasoning-aware token budget (alternative to max_tokens)

prompt_cache_key

  • Type: String
  • Purpose: Session affinity key for KV cache reuse
  • Example: prompt_cache_key: "user-42"

prompt_cache_isolation_key

  • Type: String
  • Purpose: Namespacing key for KV cache separation across users/tenants

prompt_truncate_len

  • Type: positive integer
  • Purpose: Truncate prompts to this many tokens before sending

safe_tokenization

  • Type: boolean
  • Purpose: Reject special tokens injected via user content
  • Example: safe_tokenization: true

response_format

  • Type: map
  • Purpose: Raw response_format field (escape hatch for advanced cases — prefer fireworks_structured_output_mode)

fireworks_structured_output_mode

  • Type: :auto | :json_schema | :tool

  • Default: :auto (native response_format: json_schema with strict enforcement)
  • Purpose: Strategy for generate_object/4
  • Example: fireworks_structured_output_mode: :tool

fireworks_json_schema_strict

  • Type: boolean
  • Default: true
  • Purpose: When true, enforces additionalProperties: false and marks all properties required on the json_schema response_format

perf_metrics_in_response

  • Type: boolean
  • Purpose: Include latency/throughput metrics in the response body

raw_output

  • Type: boolean
  • Purpose: Return low-level model interaction details in the response

speculation

  • Type: map
  • Purpose: Speculative decoding configuration

prediction

  • Type: map
  • Purpose: Predicted output for speculative decoding

metadata

  • Type: map
  • Purpose: Arbitrary metadata stored on the request for tracing/distillation

reasoning_effort (core option)

  • Type: :none | :minimal | :low | :medium | :high | :xhigh | :default

  • Purpose: Control reasoning depth on reasoning-capable models
  • Note: Fireworks accepts an additional "max" string value not covered by the core schema — pass it via provider_options: [reasoning_effort: "max"]

Structured Output

Three strategies, selectable via fireworks_structured_output_mode:

schema = [
  name: [type: :string, required: true],
  age: [type: :pos_integer, required: true]
]

# Default: native response_format json_schema with strict enforcement
ReqLLM.generate_object(
  "fireworks_ai:accounts/fireworks/models/glm-5",
  "Generate a software engineer profile.",
  schema
)

# Fallback: tool-call workaround (older models or explicit preference)
ReqLLM.generate_object(
  "fireworks_ai:accounts/fireworks/models/glm-5",
  "Generate a profile.",
  schema,
  fireworks_structured_output_mode: :tool
)
ModeWire format
:auto (default)response_format: json_schema (strict)
:json_schemaSame as :auto
:tooltools[] + forced tool_choice

Implementation Notes

max_tokens > 4096 requires streaming

Fireworks rejects non-streaming chat requests with max_tokens over 4096. ReqLLM caps the value automatically and logs a warning:

[warning] Fireworks rejects non-streaming requests with max_tokens > 4096; capping to 4096

For longer completions, switch to ReqLLM.stream_text/3.

Tool message metadata stripped

The OpenAI envelope allows arbitrary metadata on messages; Fireworks 400s on it. ReqLLM strips message-level metadata from the encoded body. Top-level request metadata (passed as a provider option) is still forwarded.

tool_choice shape translation

Canonical ReqLLM tool choice (%{type: "tool", name: "..."}) is auto-translated to OpenAI's %{type: "function", function: %{name: "..."}} form for Fireworks.

top_k

top_k can be passed either as a core option or nested under :provider_options:

ReqLLM.generate_text(model, "Hi", top_k: 40)
ReqLLM.generate_text(model, "Hi", provider_options: [top_k: 40])

Streaming usage

stream_options.include_usage: true is set automatically so the final SSE chunk carries final token counts.

Default receive timeout

Reasoning completions on K2.5-class models can run several minutes. The provider sets receive_timeout: 300_000 (5 min) by default; override per request if needed.

Unsupported endpoints

ReqLLM targets Fireworks's /inference/v1/chat/completions endpoint. Out of scope: legacy /v1/completions, Anthropic-shaped /v1/messages (use the Anthropic provider instead), and platform automation APIs (accounts, datasets, deployments, fine-tuning, evals, LoRA, secrets).

Resources