# Scalability and Parallelism This guide covers the scalability features of erlang_python, including execution modes, rate limiting, and parallel execution. ## Execution Modes erlang_python automatically detects the optimal execution mode based on your Python version: ```erlang %% Check current execution mode py:execution_mode(). %% => free_threaded | subinterp | multi_executor %% Check number of executor threads py:num_executors(). %% => 4 (default) ``` ### Mode Comparison | Mode | Python Version | Parallelism | GIL Behavior | Best For | |------|----------------|-------------|--------------|----------| | **free_threaded** | 3.13+ (nogil build) | True N-way | None | Maximum throughput | | **owngil** | 3.12+ | True N-way | Per-interpreter (dedicated thread) | CPU-bound parallel | | **subinterp** | 3.12+ | None (shared GIL) | Shared GIL (pool) | High call frequency | | **multi_executor** | Any | GIL contention | Shared, round-robin | I/O-bound, compatibility | ### Free-Threaded Mode (Python 3.13+) When running on a free-threaded Python build (compiled with `--disable-gil`), erlang_python executes Python calls directly without any executor routing. This provides maximum parallelism for CPU-bound workloads. ### OWN_GIL Mode (Python 3.12+) Creates dedicated pthreads with independent GILs for true parallel Python execution. Each OWN_GIL context runs in its own thread, enabling CPU parallelism. **Architecture:** - Each context gets a dedicated pthread with its own subinterpreter and GIL - Requests dispatched via mutex/condvar IPC (not dirty schedulers) - True parallel execution across multiple OWN_GIL contexts - Higher per-call latency (~10μs vs ~2.5μs) but better parallelism **Usage:** ```erlang %% Create OWN_GIL contexts for parallel execution {ok, Ctx1} = py_context:start_link(1, owngil), {ok, Ctx2} = py_context:start_link(2, owngil), %% These execute in parallel with independent GILs spawn(fun() -> py_context:call(Ctx1, heavy_compute, run, [Data1]) end), spawn(fun() -> py_context:call(Ctx2, heavy_compute, run, [Data2]) end). ``` **Process-Local Environments:** ```erlang %% Multiple processes can share an OWN_GIL context with isolated namespaces {ok, Env} = py_context:create_local_env(Ctx), CtxRef = py_context:get_nif_ref(Ctx), ok = py_nif:context_exec(CtxRef, <<"x = 42">>, Env), {ok, 42} = py_nif:context_eval(CtxRef, <<"x">>, #{}, Env). ``` **When to use OWN_GIL:** - CPU-bound Python workloads that benefit from parallelism - Long-running computations - When you need true concurrent Python execution - Scientific computing, ML inference, data processing **See also:** [OWN_GIL Internals](owngil_internals.md) for architecture details. ### Sub-interpreter Mode (Python 3.12+) Uses Python's sub-interpreter feature with a shared GIL pool. Multiple contexts share the GIL but have isolated namespaces. Best for high call frequency with low latency. **Architecture:** - Pool of pre-created subinterpreters with shared GIL - Execution on dirty schedulers with `PyThreadState_Swap` - Lower latency (~2.5μs) but no true parallelism - Best throughput for short operations **Note:** Each sub-interpreter has isolated state. Use the [Shared State](#shared-state) API to share data between workers. **Explicit Context Selection:** ```erlang %% Get a specific context by index (1-based) Ctx = py:context(1), {ok, Result} = py:call(Ctx, math, sqrt, [16]). %% Or use automatic scheduler-affinity routing {ok, Result} = py:call(math, sqrt, [16]). ``` ### Multi-Executor Mode (Python < 3.12) Runs N executor threads that share the GIL. Requests are distributed round-robin across executors. Good for I/O-bound workloads where Python releases the GIL during I/O operations. ## Choosing the Right Mode ### Mode Comparison | Aspect | Free-Threaded | Subinterpreter | Multi-Executor | |--------|---------------|----------------|----------------| | **Parallelism** | True N-way | True N-way | GIL contention | | **State Isolation** | Shared | Isolated | Shared | | **Memory Overhead** | Low | Higher (per-interp) | Low | | **Module Compatibility** | Limited | Most modules | All modules | | **Python Version** | 3.13+ (nogil) | 3.12+ | Any | ### When to Use Each Mode **Use Free-Threaded (Python 3.13t) when:** - You need maximum parallelism with shared state - Your libraries are GIL-free compatible - You're running CPU-bound workloads - Memory efficiency is important **Use OWN_GIL (Python 3.12+) when:** - You need true CPU parallelism across Python contexts - Running long computations (ML inference, data processing) - Workload benefits from multiple independent Python interpreters - You can tolerate higher per-call latency for better throughput **Use Subinterpreters/Shared-GIL (Python 3.12+) when:** - You need high call frequency with low latency - Individual operations are short - You want namespace isolation without thread overhead - Memory efficiency is important (shared interpreter pool) **Use Multi-Executor (Python < 3.12) when:** - Running on older Python versions - Your workload is I/O-bound (GIL released during I/O) - You need compatibility with all Python modules - Shared state between workers is required ### Pros and Cons **Subinterpreter Mode Pros:** - True parallelism without GIL contention - Complete isolation (crashes don't affect other contexts) - Each context has clean namespace (no state bleed) - 25-30% faster cast operations vs worker mode **Subinterpreter Mode Cons:** - Higher memory usage (each interpreter loads modules separately) - Some C extensions don't support subinterpreters - No shared state between contexts (use Shared State API) - asyncio event loop integration requires main interpreter **Free-Threaded Mode Pros:** - True parallelism with shared state - Lower memory overhead than subinterpreters - Simplest mental model (like regular threading) **Free-Threaded Mode Cons:** - Requires Python 3.13+ built with `--disable-gil` - Many C extensions not yet compatible - Shared state requires careful synchronization - Still experimental ## Subinterpreter Architecture ### Design Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ Erlang VM (BEAM) │ ├─────────────────────────────────────────────────────────────────┤ │ py_context_router │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Scheduler 1 ──► Context 1 (pid) │ │ │ │ Scheduler 2 ──► Context 2 (pid) │ │ │ │ Scheduler N ──► Context N (pid) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Context │ │ Context │ │ Context │ │ │ │ Process │ │ Process │ │ Process │ │ │ │ (gen_srv)│ │ (gen_srv)│ │ (gen_srv)│ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ └───────┼──────────────┼──────────────┼───────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Subinterpreter Thread Pool │ ├─────────────────────────────────────────────────────────────────┤ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Thread 1 │ │ Thread 2 │ │ Thread N │ │ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ │ │ Interp │ │ │ │ Interp │ │ │ │ Interp │ │ │ │ │ │ (GIL 1) │ │ │ │ (GIL 2) │ │ │ │ (GIL N) │ │ │ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Each thread owns its interpreter's GIL (Py_GIL_OWN) │ │ No GIL contention between threads │ └─────────────────────────────────────────────────────────────────┘ ``` ### Key Components **py_context_router**: Routes requests to context processes based on scheduler affinity or explicit binding. **py_context_process**: Gen_server that owns a Python context reference and handles call/eval/exec operations. **Subinterpreter Thread Pool (C)**: Manages N threads, each with its own Python subinterpreter created with `Py_NewInterpreterFromConfig()` and `Py_GIL_OWN`. ### Request Flow 1. Erlang process calls `py:call(Module, Func, Args)` 2. `py_context_router` selects context based on scheduler ID 3. Request sent to `py_context_process` gen_server 4. Gen_server calls NIF which executes on subinterpreter's thread 5. Result returned through gen_server to caller ### Pool Size The subinterpreter pool size is configured at two levels: | Level | Default | Max | |-------|---------|-----| | **Erlang (py_context_router)** | `erlang:system_info(schedulers)` | configurable | | **C pool (py_subinterp_pool)** | 32 | 64 | On a typical 8-core machine, 8 context processes are started, each with one subinterpreter slot. **Configuration via sys.config:** ```erlang {erlang_python, [ {num_contexts, 16} %% Override scheduler count ]} ``` **Configuration at runtime:** ```erlang %% Start with custom pool size py_context_router:start(#{contexts => 16}). ``` ### Thread Safety - Each subinterpreter has its own GIL (no cross-interpreter contention) - NIF calls are serialized per-context via gen_server - Erlang message passing provides synchronization - C code uses atomics for cross-thread state (`thread_running` flag) ## Rate Limiting All Python calls pass through an ETS-based counting semaphore that prevents overload: ```erlang %% Check semaphore status py_semaphore:max_concurrent(). %% => 29 (schedulers * 2 + 1) py_semaphore:current(). %% => 0 (currently running) %% Dynamically adjust limit py_semaphore:set_max_concurrent(50). ``` ### How It Works ``` ┌─────────────────────────────────────────────────────────────┐ │ py_semaphore │ │ │ │ ┌─────────┐ ┌─────────────────────────────────────┐ │ │ │ Counter │◄───│ ets:update_counter (atomic) │ │ │ │ [29] │ │ {write_concurrency, true} │ │ │ └─────────┘ └─────────────────────────────────────┘ │ │ │ │ acquire(Timeout) ──► increment ──► check ≤ max? │ │ │ │ │ │ │ yes │ no │ │ │ │ │ │ │ │ ok │ └──► backoff │ │ │ │ loop │ │ release() ──────────►└──── decrement ──┘ │ └─────────────────────────────────────────────────────────────┘ ``` ### Overload Protection When the semaphore is exhausted, `py:call` returns an overload error instead of blocking forever: ```erlang {error, {overloaded, Current, Max}} = py:call(module, func, []). ``` This allows your application to implement backpressure or shed load gracefully. ## Configuration ```erlang %% sys.config [ {erlang_python, [ %% Maximum concurrent Python operations (semaphore limit) %% Default: erlang:system_info(schedulers) * 2 + 1 {max_concurrent, 50}, %% Number of executor threads (multi_executor mode only) %% Default: 4 {num_executors, 8}, %% Worker pool sizes {num_workers, 4}, {num_async_workers, 2}, {num_subinterp_workers, 4} ]} ]. ``` ## Parallel Execution with Sub-interpreters For CPU-bound workloads on Python 3.12+, erlang_python provides true parallelism via OWN_GIL subinterpreters. ### Check Support ```erlang %% Check if subinterpreters are supported (Python 3.12+) true = py:subinterp_supported(). %% Check current execution mode subinterp = py:execution_mode(). ``` ### Using the Context Router The context router automatically distributes calls across subinterpreters: ```erlang %% Start contexts (usually done by application startup) {ok, _} = py:start_contexts(). %% Calls are automatically routed to subinterpreters {ok, 4.0} = py:call(math, sqrt, [16]). {ok, 6} = py:eval(<<"2 + 4">>). ok = py:exec(<<"x = 42">>). ``` ### Explicit Context Selection For fine-grained control, use explicit context selection: ```erlang %% Get a specific context by index (1-based) Ctx = py:context(1), %% All operations on this context share state ok = py:exec(Ctx, <<"my_var = 'hello'">>), {ok, <<"hello">>} = py:eval(Ctx, <<"my_var">>), {ok, 4.0} = py:call(Ctx, math, sqrt, [16]). %% Different context has isolated state Ctx2 = py:context(2), {error, _} = py:eval(Ctx2, <<"my_var">>). %% Not defined in Ctx2 ``` ### Context Router API ```erlang %% Start router with default number of contexts (scheduler count) {ok, Contexts} = py_context_router:start(). %% Start with custom number of contexts {ok, Contexts} = py_context_router:start(#{contexts => 8}). %% Get context for current scheduler (automatic affinity) Ctx = py_context_router:get_context(). %% Get specific context by index Ctx = py_context_router:get_context(1). %% Bind current process to a specific context ok = py_context_router:bind_context(Ctx). %% Unbind (return to scheduler-based routing) ok = py_context_router:unbind_context(). %% Get number of active contexts N = py_context_router:num_contexts(). %% Stop all contexts ok = py_context_router:stop(). ``` ### Parallel Execution Execute multiple calls in parallel across subinterpreters: ```erlang %% Execute multiple calls in parallel {ok, Results} = py:parallel([ {math, sqrt, [16]}, {math, sqrt, [25]}, {math, sqrt, [36]} ]). %% => {ok, [{ok, 4.0}, {ok, 5.0}, {ok, 6.0}]} ``` Each call runs in its own sub-interpreter with its own GIL, enabling true parallelism. ## Testing with Free-Threading To test with a free-threaded Python build: ### 1. Install Python 3.13+ with Free-Threading ```bash # macOS with Homebrew brew install python@3.13 --with-freethreading # Or build from source ./configure --disable-gil make && make install # Or use pyenv PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0 ``` ### 2. Verify Free-Threading is Enabled ```bash python3 -c "import sys; print('GIL disabled:', hasattr(sys, '_is_gil_enabled') and not sys._is_gil_enabled())" ``` ### 3. Rebuild erlang_python ```bash # Clean and rebuild with free-threaded Python rebar3 clean PYTHON_CONFIG=/path/to/python3.13-config rebar3 compile ``` ### 4. Verify Mode ```erlang 1> application:ensure_all_started(erlang_python). 2> py:execution_mode(). free_threaded ``` ## Performance Tuning ### For CPU-Bound Workloads - Use `py:parallel/1` with sub-interpreters (Python 3.12+) - Or use free-threaded Python (3.13+) - Increase `max_concurrent` to match available CPU cores ### For I/O-Bound Workloads - Multi-executor mode works well (GIL released during I/O) - Increase `num_executors` to handle more concurrent I/O - Use asyncio integration for async I/O ### For Mixed Workloads - Balance `max_concurrent` based on memory constraints - Monitor `py_semaphore:current()` for load metrics - Implement application-level backpressure based on overload errors ## Monitoring ```erlang %% Current load Load = py_semaphore:current(), Max = py_semaphore:max_concurrent(), Utilization = Load / Max * 100, io:format("Python load: ~.1f%~n", [Utilization]). %% Execution mode info Mode = py:execution_mode(), Executors = py:num_executors(), io:format("Mode: ~p, Executors: ~p~n", [Mode, Executors]). %% Memory stats {ok, Stats} = py:memory_stats(), io:format("GC stats: ~p~n", [maps:get(gc_stats, Stats)]). ``` ## Shared State Since workers (and sub-interpreters) have isolated namespaces, erlang_python provides ETS-backed shared state accessible from both Python and Erlang: ```python from erlang import state_set, state_get, state_incr, state_decr # Share configuration across workers config = state_get('app_config') # Thread-safe metrics state_incr('requests_total') state_incr('bytes_processed', len(data)) ``` ```erlang %% Set config that all workers can read py:state_store(<<"app_config">>, #{model => <<"gpt-4">>, timeout => 30000}). %% Read metrics {ok, Total} = py:state_fetch(<<"requests_total">>). ``` The state is backed by ETS with `{write_concurrency, true}`, making atomic counter operations fast and lock-free. See [Getting Started](getting-started.md#shared-state) for the full API. ## Reentrant Callbacks erlang_python supports reentrant callbacks where Python code calls Erlang functions that themselves call back into Python. This is handled without deadlocking through a suspension/resume mechanism: ```erlang %% Register an Erlang function that calls Python py:register_function(compute_via_python, fun([X]) -> {ok, Result} = py:call('__main__', complex_compute, [X]), Result * 2 %% Erlang post-processing end). %% Python code that uses the callback py:exec(<<" def process(x): from erlang import call # Calls Erlang, which calls Python's complex_compute result = call('compute_via_python', x) return result + 1 ">>). ``` ### How Reentrant Callbacks Work ``` ┌─────────────────────────────────────────────────────────────────┐ │ Reentrant Callback Flow │ │ │ │ 1. Python calls erlang.call('func', args) │ │ └──► Returns suspension marker, frees dirty scheduler │ │ │ │ 2. Erlang executes the registered callback │ │ └──► May call py:call() to run Python (on different worker) │ │ │ │ 3. Erlang calls resume_callback with result │ │ └──► Schedules dirty NIF to return result to Python │ │ │ │ 4. Python continues with the callback result │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### Benefits - **No Deadlocks**: Dirty schedulers are freed during callback execution - **Nested Callbacks**: Multiple levels of Python→Erlang→Python→... are supported - **Transparent**: From Python's perspective, `erlang.call()` appears synchronous - **No Configuration**: Works automatically with all execution modes ### Performance Considerations - Reentrant callbacks have slightly higher overhead due to suspension/resume - For tight loops, consider batching operations to reduce callback overhead - Concurrent reentrant calls are fully supported and scale well ### Example: Nested Callbacks ```erlang %% Each level alternates between Erlang and Python py:register_function(level, fun([N, Max]) -> case N >= Max of true -> N; false -> {ok, Result} = py:call('__main__', next_level, [N + 1, Max]), Result end end). py:exec(<<" def next_level(n, max): from erlang import call return call('level', n, max) def start(max): from erlang import call return call('level', 1, max) ">>). %% Test 10 levels of nesting {ok, 10} = py:call('__main__', start, [10]). ``` ### Example See `examples/reentrant_demo.erl` and `examples/reentrant_demo.py` for a complete demonstration including: - Basic reentrant calls with arithmetic expressions - Fibonacci with Erlang memoization - Deeply nested callbacks (10+ levels) - OOP-style class method callbacks ```bash # Run the demo rebar3 shell 1> reentrant_demo:start(). 2> reentrant_demo:demo_all(). ``` ## Building for Performance ### Standard Build ```bash rebar3 compile ``` Uses `-O2` optimization and standard compiler flags. ### Performance Build For production deployments where maximum performance is needed: ```bash # Clean and rebuild with aggressive optimizations rm -rf _build/cmake mkdir -p _build/cmake && cd _build/cmake cmake ../../c_src -DPERF_BUILD=ON cmake --build . -j$(nproc) ``` The `PERF_BUILD` option enables: | Flag | Effect | |------|--------| | `-O3` | Aggressive optimization level | | `-flto` | Link-Time Optimization | | `-march=native` | CPU-specific instruction set | | `-ffast-math` | Relaxed floating-point math | | `-funroll-loops` | Loop unrolling | **Caveats:** - Binaries are not portable (tied to build machine's CPU) - Build time increases due to LTO - `-ffast-math` may affect floating-point precision ### Verifying the Build ```erlang %% Check that the NIF loaded successfully 1> application:ensure_all_started(erlang_python). {ok, [erlang_python]} %% Run basic verification 2> py:eval("1 + 1"). {ok, 2} ``` ## See Also - [Getting Started](getting-started.md) - Basic usage - [Memory Management](memory.md) - GC and memory debugging - [Streaming](streaming.md) - Working with generators - [Asyncio](asyncio.md) - Event loop performance details