Scalability and Parallelism
View SourceThis guide covers the scalability features of erlang_python, including execution modes, rate limiting, and parallel execution.
Execution Modes
erlang_python supports two execution modes:
%% Check current execution mode
py:execution_mode().
%% => worker | owngilMode Comparison
| Mode | Description | Parallelism | GIL Behavior | Best For |
|---|---|---|---|---|
| worker | Dedicated pthread per context | GIL contention | Shared GIL | Default, maximum compatibility |
| owngil | Dedicated pthread + subinterpreter | True N-way | Per-interpreter GIL | CPU-bound parallel (Python 3.14+) |
Worker Mode (Default)
Each context gets a dedicated pthread that handles all Python operations. This provides stable thread affinity, which is critical for libraries like numpy, torch, and tensorflow that maintain thread-local state. The regression contract for that guarantee lives in test/py_ml_libs_SUITE.erl, which drives real numpy and tensorflow operations through exec / eval / call and across multiple Erlang processes targeting the same context.
OWN_GIL Mode (Python 3.12+)
Creates dedicated pthreads with independent GILs for true parallel Python execution. Each OWN_GIL context runs in its own thread, enabling CPU parallelism.
Architecture:
- Each context gets a dedicated pthread with its own subinterpreter and GIL
- Requests dispatched via mutex/condvar IPC (not dirty schedulers)
- True parallel execution across multiple OWN_GIL contexts
- Higher per-call latency (~10μs vs ~2.5μs) but better parallelism
Usage:
%% Create OWN_GIL contexts for parallel execution
{ok, Ctx1} = py_context:start_link(1, owngil),
{ok, Ctx2} = py_context:start_link(2, owngil),
%% These execute in parallel with independent GILs
spawn(fun() -> py_context:call(Ctx1, heavy_compute, run, [Data1]) end),
spawn(fun() -> py_context:call(Ctx2, heavy_compute, run, [Data2]) end).Process-Local Environments:
%% Multiple processes can share an OWN_GIL context with isolated namespaces
{ok, Env} = py_context:create_local_env(Ctx),
CtxRef = py_context:get_nif_ref(Ctx),
ok = py_nif:context_exec(CtxRef, <<"x = 42">>, Env),
{ok, 42} = py_nif:context_eval(CtxRef, <<"x">>, #{}, Env).When to use OWN_GIL:
- CPU-bound Python workloads that benefit from parallelism
- Long-running computations
- When you need true concurrent Python execution
- Scientific computing, ML inference, data processing
See also: OWN_GIL Internals for architecture details.
Explicit Context Selection:
%% Get a specific context by index (1-based)
Ctx = py:context(1),
{ok, Result} = py:call(Ctx, math, sqrt, [16]).
%% Or use automatic scheduler-affinity routing
{ok, Result} = py:call(math, sqrt, [16]).Choosing the Right Mode
When to Use Each Mode
Use Worker Mode (default) when:
- You need maximum module compatibility
- Running libraries like numpy, torch, tensorflow
- High call frequency with low latency
- Shared state between contexts is needed
Use OWN_GIL Mode when:
- You need true CPU parallelism across Python contexts
- Running long computations (ML inference, data processing)
- Workload benefits from multiple independent Python interpreters
- You can tolerate higher per-call latency for better throughput
Pros and Cons
Worker Mode Pros:
- Maximum module compatibility (all C extensions work)
- Stable thread affinity for numpy/torch/tensorflow
- Low memory overhead (single interpreter)
- Shared state between contexts
Worker Mode Cons:
- GIL contention limits parallelism
- No isolation between contexts
OWN_GIL Mode Pros:
- True parallelism without GIL contention
- Complete isolation (crashes don't affect other contexts)
- Each context has clean namespace (no state bleed)
OWN_GIL Mode Cons:
- Higher memory usage (each interpreter loads modules separately)
- Some C extensions don't support subinterpreters
- Requires Python 3.14+
- Higher per-call latency (~4x worker)
- Callback re-entry to the same context is restricted (
erlang.callfrom inside an OWN_GIL context routes to a thread worker, not back to that context)
For a fuller breakdown of OWN_GIL tradeoffs, common pitfalls, and a usage decision guide, see OWN_GIL Internals: Pros and Cons.
Subinterpreter Architecture
Design Overview
┌─────────────────────────────────────────────────────────────────┐
│ Erlang VM (BEAM) │
├─────────────────────────────────────────────────────────────────┤
│ py_context_router │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Scheduler 1 ──► Context 1 (pid) │ │
│ │ Scheduler 2 ──► Context 2 (pid) │ │
│ │ Scheduler N ──► Context N (pid) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Context │ │ Context │ │ Context │ │
│ │ Process │ │ Process │ │ Process │ │
│ │ (gen_srv)│ │ (gen_srv)│ │ (gen_srv)│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼──────────────┼──────────────┼───────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Subinterpreter Thread Pool │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Thread 1 │ │ Thread 2 │ │ Thread N │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Interp │ │ │ │ Interp │ │ │ │ Interp │ │ │
│ │ │ (GIL 1) │ │ │ │ (GIL 2) │ │ │ │ (GIL N) │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Each thread owns its GIL (PyInterpreterConfig_OWN_GIL) │
│ No GIL contention between threads │
└─────────────────────────────────────────────────────────────────┘Key Components
py_context_router: Routes requests to context processes based on scheduler affinity or explicit binding.
py_context_process: Gen_server that owns a Python context reference and handles call/eval/exec operations.
Subinterpreter Thread Pool (C): Manages N threads, each with its own Python subinterpreter created with Py_NewInterpreterFromConfig() and PyInterpreterConfig_OWN_GIL.
Request Flow
- Erlang process calls
py:call(Module, Func, Args) py_context_routerselects context based on scheduler ID- Request sent to
py_context_processgen_server - Gen_server calls NIF which executes on subinterpreter's thread
- Result returned through gen_server to caller
Pool Size
py_context_router sizes the context pool from num_contexts (default
erlang:system_info(schedulers)). Each context owns its own pthread; in
owngil mode that thread also owns a dedicated subinterpreter. There is no
shared C-level pool.
Configuration via sys.config:
{erlang_python, [
{num_contexts, 16} %% Override scheduler count
]}Configuration at runtime:
%% Start with custom pool size
py_context_router:start(#{contexts => 16}).Thread Safety
- Each subinterpreter has its own GIL (no cross-interpreter contention)
- NIF calls are serialized per-context via gen_server
- Erlang message passing provides synchronization
- C code uses atomics for cross-thread state (
thread_runningflag)
Rate Limiting
All Python calls pass through an ETS-based counting semaphore that prevents overload:
%% Check semaphore status
py_semaphore:max_concurrent(). %% => 29 (schedulers * 2 + 1)
py_semaphore:current(). %% => 0 (currently running)
%% Dynamically adjust limit
py_semaphore:set_max_concurrent(50).How It Works
┌─────────────────────────────────────────────────────────────┐
│ py_semaphore │
│ │
│ ┌─────────┐ ┌─────────────────────────────────────┐ │
│ │ Counter │◄───│ ets:update_counter (atomic) │ │
│ │ [29] │ │ {write_concurrency, true} │ │
│ └─────────┘ └─────────────────────────────────────┘ │
│ │
│ acquire(Timeout) ──► increment ──► check ≤ max? │
│ │ │ │
│ │ yes │ no │
│ │ │ │ │
│ │ ok │ └──► backoff │
│ │ │ loop │
│ release() ──────────►└──── decrement ──┘ │
└─────────────────────────────────────────────────────────────┘Overload Protection
When the semaphore is exhausted, py:call returns an overload error instead of blocking forever:
{error, {overloaded, Current, Max}} = py:call(module, func, []).This allows your application to implement backpressure or shed load gracefully.
Configuration
%% sys.config
[
{erlang_python, [
%% Maximum concurrent Python operations (semaphore limit)
%% Default: erlang:system_info(schedulers) * 2 + 1
{max_concurrent, 50},
%% Context mode: worker | owngil
%% Default: worker
{context_mode, worker},
%% Number of contexts
%% Default: erlang:system_info(schedulers)
{num_contexts, 8}
]}
].Parallel Execution with Sub-interpreters
For CPU-bound workloads on Python 3.12+, erlang_python provides true parallelism via OWN_GIL subinterpreters.
Check Support
%% Check if subinterpreters are supported (Python 3.12+)
true = py:subinterp_supported().
%% Check current execution mode (mirrors context_mode app env)
worker = py:execution_mode(). %% or owngilUsing the Context Router
The context router automatically distributes calls across subinterpreters:
%% Start contexts (usually done by application startup)
{ok, _} = py:start_contexts().
%% Calls are automatically routed to subinterpreters
{ok, 4.0} = py:call(math, sqrt, [16]).
{ok, 6} = py:eval(<<"2 + 4">>).
ok = py:exec(<<"x = 42">>).Explicit Context Selection
For fine-grained control, use explicit context selection:
%% Get a specific context by index (1-based)
Ctx = py:context(1),
%% All operations on this context share state
ok = py:exec(Ctx, <<"my_var = 'hello'">>),
{ok, <<"hello">>} = py:eval(Ctx, <<"my_var">>),
{ok, 4.0} = py:call(Ctx, math, sqrt, [16]).
%% Different context has isolated state
Ctx2 = py:context(2),
{error, _} = py:eval(Ctx2, <<"my_var">>). %% Not defined in Ctx2Context Router API
%% Start router with default number of contexts (scheduler count)
{ok, Contexts} = py_context_router:start().
%% Start with custom number of contexts
{ok, Contexts} = py_context_router:start(#{contexts => 8}).
%% Get context for current scheduler (automatic affinity)
Ctx = py_context_router:get_context().
%% Get specific context by index
Ctx = py_context_router:get_context(1).
%% Bind current process to a specific context
ok = py_context_router:bind_context(Ctx).
%% Unbind (return to scheduler-based routing)
ok = py_context_router:unbind_context().
%% Get number of active contexts
N = py_context_router:num_contexts().
%% Stop all contexts
ok = py_context_router:stop().Parallel Execution
Execute multiple calls in parallel across subinterpreters:
%% Execute multiple calls in parallel
{ok, Results} = py:parallel([
{math, sqrt, [16]},
{math, sqrt, [25]},
{math, sqrt, [36]}
]).
%% => {ok, [{ok, 4.0}, {ok, 5.0}, {ok, 6.0}]}Each call runs in its own sub-interpreter with its own GIL, enabling true parallelism.
Testing with Free-Threading
To test with a free-threaded Python build:
1. Install Python 3.13+ with Free-Threading
# macOS with Homebrew
brew install python@3.13 --with-freethreading
# Or build from source
./configure --disable-gil
make && make install
# Or use pyenv
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0
2. Verify Free-Threading is Enabled
python3 -c "import sys; print('GIL disabled:', hasattr(sys, '_is_gil_enabled') and not sys._is_gil_enabled())"
3. Rebuild erlang_python
# Clean and rebuild with free-threaded Python
rebar3 clean
PYTHON_CONFIG=/path/to/python3.13-config rebar3 compile
4. Verify Mode
1> application:ensure_all_started(erlang_python).
2> py:execution_mode().
worker
%% Free-threaded Python is detected internally; the public mode mirrors
%% the configured context_mode (worker | owngil), and worker mode
%% automatically benefits from the free-threaded build.Performance Tuning
For CPU-Bound Workloads
- Use
py:parallel/1with sub-interpreters (Python 3.12+) - Or use free-threaded Python (3.13+)
- Increase
max_concurrentto match available CPU cores
For I/O-Bound Workloads
- Worker mode works well (GIL released during I/O)
- Use asyncio integration for async I/O
- Increase
num_contextsfor more concurrent I/O capacity
For Mixed Workloads
- Balance
max_concurrentbased on memory constraints - Monitor
py_semaphore:current()for load metrics - Implement application-level backpressure based on overload errors
Monitoring
%% Current load
Load = py_semaphore:current(),
Max = py_semaphore:max_concurrent(),
Utilization = Load / Max * 100,
io:format("Python load: ~.1f%~n", [Utilization]).
%% Execution mode info
Mode = py:execution_mode(),
io:format("Mode: ~p~n", [Mode]).
%% Memory stats
{ok, Stats} = py:memory_stats(),
io:format("GC stats: ~p~n", [maps:get(gc_stats, Stats)]).Shared State
Since workers (and sub-interpreters) have isolated namespaces, erlang_python provides ETS-backed shared state accessible from both Python and Erlang:
from erlang import state_set, state_get, state_incr, state_decr
# Share configuration across workers
config = state_get('app_config')
# Thread-safe metrics
state_incr('requests_total')
state_incr('bytes_processed', len(data))%% Set config that all workers can read
py:state_store(<<"app_config">>, #{model => <<"gpt-4">>, timeout => 30000}).
%% Read metrics
{ok, Total} = py:state_fetch(<<"requests_total">>).The state is backed by ETS with {write_concurrency, true}, making atomic
counter operations fast and lock-free. See Getting Started
for the full API.
Reentrant Callbacks
erlang_python supports reentrant callbacks where Python code calls Erlang functions that themselves call back into Python. This is handled without deadlocking through a suspension/resume mechanism:
%% Register an Erlang function that calls Python
py:register_function(compute_via_python, fun([X]) ->
{ok, Result} = py:call('__main__', complex_compute, [X]),
Result * 2 %% Erlang post-processing
end).
%% Python code that uses the callback
py:exec(<<"
def process(x):
from erlang import call
# Calls Erlang, which calls Python's complex_compute
result = call('compute_via_python', x)
return result + 1
">>).How Reentrant Callbacks Work
┌─────────────────────────────────────────────────────────────────┐
│ Reentrant Callback Flow │
│ │
│ 1. Python calls erlang.call('func', args) │
│ └──► Returns suspension marker, frees dirty scheduler │
│ │
│ 2. Erlang executes the registered callback │
│ └──► May call py:call() to run Python (on different worker) │
│ │
│ 3. Erlang calls resume_callback with result │
│ └──► Schedules dirty NIF to return result to Python │
│ │
│ 4. Python continues with the callback result │
│ │
└─────────────────────────────────────────────────────────────────┘Benefits
- No Deadlocks: Dirty schedulers are freed during callback execution
- Nested Callbacks: Multiple levels of Python→Erlang→Python→... are supported
- Transparent: From Python's perspective,
erlang.call()appears synchronous - No Configuration: Works automatically with all execution modes
Performance Considerations
- Reentrant callbacks have slightly higher overhead due to suspension/resume
- For tight loops, consider batching operations to reduce callback overhead
- Concurrent reentrant calls are fully supported and scale well
Example: Nested Callbacks
%% Each level alternates between Erlang and Python
py:register_function(level, fun([N, Max]) ->
case N >= Max of
true -> N;
false ->
{ok, Result} = py:call('__main__', next_level, [N + 1, Max]),
Result
end
end).
py:exec(<<"
def next_level(n, max):
from erlang import call
return call('level', n, max)
def start(max):
from erlang import call
return call('level', 1, max)
">>).
%% Test 10 levels of nesting
{ok, 10} = py:call('__main__', start, [10]).Example
See examples/reentrant_demo.erl and examples/reentrant_demo.py for a complete
demonstration including:
- Basic reentrant calls with arithmetic expressions
- Fibonacci with Erlang memoization
- Deeply nested callbacks (10+ levels)
- OOP-style class method callbacks
# Run the demo
rebar3 shell
1> reentrant_demo:start().
2> reentrant_demo:demo_all().
Building for Performance
Standard Build
rebar3 compile
Uses -O2 optimization and standard compiler flags.
Performance Build
For production deployments where maximum performance is needed:
# Clean and rebuild with aggressive optimizations
rm -rf _build/cmake
mkdir -p _build/cmake && cd _build/cmake
cmake ../../c_src -DPERF_BUILD=ON
cmake --build . -j$(nproc)
The PERF_BUILD option enables:
| Flag | Effect |
|---|---|
-O3 | Aggressive optimization level |
-flto | Link-Time Optimization |
-march=native | CPU-specific instruction set |
-ffast-math | Relaxed floating-point math |
-funroll-loops | Loop unrolling |
Caveats:
- Binaries are not portable (tied to build machine's CPU)
- Build time increases due to LTO
-ffast-mathmay affect floating-point precision
Verifying the Build
%% Check that the NIF loaded successfully
1> application:ensure_all_started(erlang_python).
{ok, [erlang_python]}
%% Run basic verification
2> py:eval("1 + 1").
{ok, 2}Erlang/OTP Version
erlang_python is built and tested on Erlang/OTP 28 and 29 (minimum OTP 28). The heavy lifting happens in the C NIF, so the BEAM-side glue is light, but newer OTP releases still help for free:
- OTP 29 JIT and scheduler improvements apply to the Erlang side (routing, context bookkeeping, callback dispatch) with no code change. Just run on OTP 29.
- Consistent map iteration order is now guaranteed across map operations in OTP 29. The router and state modules do not depend on iteration order, so this changes nothing here, but it removes a class of subtle bugs if you build on top.
- Building the NIF stays the same across OTP 28/29 (NIF API 2.18); no version
guards are needed in
c_src.
For doc-example testing, OTP 29 ships ct_doctest. This project keeps its own
scripts/lint_doc_snippets.escript because that linter also validates the Python
blocks in the docs, which ct_doctest does not cover.
See Also
- Getting Started - Basic usage
- Memory Management - GC and memory debugging
- Streaming - Working with generators
- Asyncio - Event loop performance details