Serving Models
Copy MarkdownExTorch provides ExTorch.JIT.Server, a GenServer that wraps a loaded model with OTP fault tolerance, serialized inference, and telemetry instrumentation.
Basic serving
# Start a model server
{:ok, pid} = ExTorch.JIT.Server.start_link(path: "model.pt")
# Run inference
input = ExTorch.randn({1, 10})
output = ExTorch.JIT.Server.predict(pid, [input])The server loads the model on init, sets it to eval mode, and serializes all predict calls through the GenServer -- ensuring thread safety for models with mutable state (BatchNorm, Dropout).
Named servers
{:ok, _} = ExTorch.JIT.Server.start_link(
path: "sentiment.pt",
device: :cpu,
name: SentimentModel
)
# Use from anywhere in the application
ExTorch.JIT.Server.predict(SentimentModel, [input])Supervision
Add model servers to your application's supervision tree:
# application.ex
def start(_type, _args) do
children = [
# ExTorch starts its own DynamicSupervisor automatically
{ExTorch.JIT.Server, path: "model_a.pt", name: ModelA},
{ExTorch.JIT.Server, path: "model_b.pt", name: ModelB, device: {:cuda, 0}},
]
Supervisor.start_link(children, strategy: :one_for_one)
endIf a model server crashes (e.g., due to a malformed input), the supervisor restarts it automatically, reloading the model from disk.
Dynamic model loading
Use the built-in ExTorch.ModelSupervisor to start servers at runtime:
DynamicSupervisor.start_child(ExTorch.ModelSupervisor, {
ExTorch.JIT.Server,
path: "new_model.pt", name: NewModel
})Server info
ExTorch.JIT.Server.info(SentimentModel)
# => %{
# path: "sentiment.pt",
# device: :cpu,
# inference_count: 1523,
# error_count: 0,
# uptime_ms: 3600000
# }Models with complex outputs
Models that return tuples, dicts, or nested structures work naturally:
# Python: def forward(self, x): return {"logits": ..., "features": ...}
output = ExTorch.JIT.Server.predict(MyModel, [input])
# output is an Elixir map: %{"logits" => %Tensor{}, "features" => %Tensor{}}
# Python: def forward(self, x): return self.head1(x), self.head2(x)
{head1, head2} = ExTorch.JIT.Server.predict(MultiHead, [input])CUDA serving
# Load on GPU
{:ok, _} = ExTorch.JIT.Server.start_link(
path: "model.pt",
device: {:cuda, 0},
name: GPUModel
)
# Input tensors can be on CPU -- libtorch handles the transfer
# For best performance, pre-transfer inputs:
gpu_input = ExTorch.Tensor.to(input, device: {:cuda, 0})
output = ExTorch.JIT.Server.predict(GPUModel, [gpu_input])Telemetry events
Every predict call emits telemetry events that you can attach to:
:telemetry.attach("my-logger", [:extorch, :jit, :forward, :stop], fn _event, measurements, metadata, _config ->
duration_ms = System.convert_time_unit(measurements.duration, :native, :microsecond) / 1000
IO.puts("#{metadata.path}: #{duration_ms}ms (#{metadata.input_count} inputs)")
end, nil)See the Observability guide for the full metrics and dashboard setup.