All notable changes to this project will be documented in this file.
This project adheres to Semantic Versioning.
[Unreleased]
Focused code-review pass across the NIF, shepherd, and Elixir layers. Correctness-first: closes two real-world race/leak bugs, hardens the post-fork child window, and adds an AddressSanitizer + UBSan CI job.
Fixed
- FD leak in
nif_create_fdwhenenif_mutex_createfailed — the destructor previously gatedclose(fd)on a non-NULL lock, so a failed mutex allocation leaked the file descriptor and armed a NULL-deref in any laternif_close. The mutex result is checked and the dtor now closes the fd unconditionally. - Use-after-close race in NIF read/write vs. close/down
—
nif_read/nif_writecopiedres->fdunder the mutex and released the lock before the syscall; a concurrentnif_closeor owner-death callback could close the fd before the syscall ran, letting the read/write target a recycled fd. The mutex is now held across the syscall and the subsequentenif_selectregistration; the actualclose()is deferred to theio_resource_stopcallback so BEAM can drain pending selects before the fd is released. - Lost initial stderr chunk in
:consumemode —kick_stderr_readininit/1sent{:stderr_data, data}toself()but nohandle_info/2clause matched, so the first (and often only) chunk of stderr for fast-exiting processes was silently dropped. The missing handler now appends to the stderr buffer and drains any remainder. write_loopspin on{:ok, 0}— if the kernel ever returned 0 bytes on a non-empty write, the GenServer would recurse forever on the dirty scheduler. Bounded with a 1 ms sleep-retry.- Shepherd UDS command framing — the event loop parsed only
buf[0], discarding any coalesced or tail commands (e.g.CMD_CLOSE_STDINfollowed immediately byCMD_KILL). Frames are now length-dispatched per opcode with a carry-over buffer acrosspoll()iterations. - Post-fork child stdio and signal safety — replaced
fprintf/strerrorin the post-fork / pre-exec window with awrite(2)- basedchild_fail()helper (async-signal-safe). Everydup2,setsid, andTIOCSCTTYreturn is now checked; on failure the child exits 127 with a diagnostic instead of running with broken stdio. waitpidafter SIGKILL — replaced the unboundedwaitpid(child_pid, NULL, 0)with a bounded WNOHANG loop (~3 s cap) so the shepherd cannot hang on a child stuck in uninterruptible kernel sleep (D-state).- SIGCHLD reap loop — reap all pending children per SIGCHLD
(
while waitpid(-1, ..., WNOHANG) > 0) so a coalesced signal never leaks zombies. - Cgroup / UDS path hardening — validate every
snprintfreturn, reject too-long UDS paths, setFD_CLOEXECon the PTY master, treat user-requested cgroup setup failure as fatal, and replace the fixed 100 msusleepincgroup_cleanupwith a bounded pollingrmdir. Streamconsumer crash cleanup —Stream.resource'saftercallback is only run on normal termination. A consumer crash orphaned theNetRunner.ProcessGenServer and its OS child.NetRunner.Process.start/3now accepts an:owneroption that monitors the caller;NetRunner.Stream.stream/3passesself(), so a consumer crash SIGKILLs the OS process and stops the GenServer.- Watcher blocking on
Process.sleep— the 5 s sleep inhandle_info/2wedged the Watcher unresponsive (including to supervisor shutdown). Replaced withProcess.send_after/3and a new:escalate_to_sigkillhandler. - Parked-caller tracking in
Operations— callers parked on EAGAIN are nowProcess.monitor/1-ed; dead callers are pruned on:DOWNinstead of lingering in the pending map until process exit. read_uds_messagerace — replaced the:peek+ full-recv pattern (which could time out if the payload arrived a moment after the opcode) with an opcode-first read flow and longer timeouts.cmd/argsvalidation — reject non-binary, empty, or NUL-containing cmd and args at the spawn boundary. Passing NUL bytes throughPort.open'sargs:is undefined on the C side.NetRunner.run/2error surface — previously pattern-matched{:ok, pid}fromProc.start, raisingMatchErrorwhen validation failed. Now returns{:error, reason}cleanly.File.rmcleanup of UDS socket — tolerate:enoent(shepherd may have unlinked), propagate other errors.Signal.resolveinteger range — integer signals outside POSIX1..31now return{:error, :unknown_signal}instead of being forwarded tokill(2).Signalsingle source of truth —Signal.resolvedelegates to the NIF for known-atom lookup instead of maintaining a duplicate allow-list that drifted from the C side.- Daemon drain resilience — drain-task crashes used to match a
catch-all
:DOWNhandler and silently stop draining; the pipe then filled until the child blocked. Narrowed to recognised refs with a warning log;drain_loopwrapped intry/rescue/catchso a reader or logger exception cannot take the daemon down through the linked Task. terminate/2explicitly closes the shepherdPortafter the UDS socket for deterministic teardown order.
Added
- AddressSanitizer + UBSan — opt-in build via
SANITIZE=1 make allormake asan. New CI job (sanitizers) rebuilds the NIF and shepherd with-fsanitize=address,undefined, preloadslibasan, and runs the fullmix test. The publish job depends on it. - Stale UDS socket sweep in
test/test_helper.exs(before and after the suite) — stops accumulation from test crashes beforecleanup_listener/2runs. - Regression tests for: NUL-byte validation in
cmdandargs,Signal.resolverange + type handling,:ownermonitor SIGKILL path, stderr-only fast-exit stats, binary-with-NUL round-trip, andNetRunner.run/NetRunner.streamreturning validation errors cleanly.
[1.0.0] - 2026-02-26
Initial release.
Core
NetRunner.run/2— run a command and collect output as{output, exit_status}NetRunner.stream!/2/NetRunner.stream/2— lazy streaming I/O with backpressureNetRunner.Process— GenServer with full lifecycle control:start/3,read/2,write/2,close_stdin/1,kill/2,await_exit/2,os_pid/1,alive?/1
Shepherd Binary (C)
- Persistent watchdog process that stays alive for the child's lifetime
- Detects BEAM death via UDS
POLLHUP— guarantees child cleanup even underSIGKILL - FD passing via
SCM_RIGHTSover Unix domain sockets poll()event loop with self-pipe trick forSIGCHLDhandling- Process group kills:
setpgid(0,0)+kill(-pgid, sig)catches grandchildren - Configurable SIGTERM → SIGKILL escalation timeout (
--kill-timeout)
NIF I/O
enif_selectintegration with BEAM's epoll/kqueue for async I/O- All NIF functions on dirty IO schedulers
- Demand-driven backpressure via OS pipe buffers +
EAGAIN+ enif_select - Resource-based FD management with destructor/stop/down callbacks
Zombie Prevention (3 layers)
- Shepherd — detects BEAM crash via UDS POLLHUP, kills child process group
- Watcher — detects GenServer crash via
Process.monitor, kills child via NIF - NIF resource destructor — closes FDs on GC, child sees broken pipe
PTY Support
pty: trueoption for pseudo-terminal emulationopenpty()withsetsid()+TIOCSCTTYfor controlling terminalset_window_size/3viaioctl(TIOCSWINSZ)- Single bidirectional master FD, duped for independent stdin/stdout NIF resources
- Platform support:
<util.h>on macOS,<pty.h>on Linux
cgroup Support (Linux)
:cgroup_pathoption for cgroup v2 resource isolation- Creates cgroup directory, moves child to
cgroup.procs - Cleanup via
cgroup.kill+rmdiron process exit - No-op on macOS/BSD
Daemon Mode
NetRunner.Daemon— supervised long-running process for supervision trees- Auto-drains stdout/stderr to prevent pipe blocking
- Output handling:
:discard(default),:log, or customfun/1callback - Graceful shutdown: SIGTERM → 5s wait → SIGKILL
Stats
NetRunner.Process.stats/1— per-process I/O statistics- Tracks:
bytes_in,bytes_out,bytes_err,read_count,write_count,duration_ms,exit_status - Zero-cost integer counters in GenServer state
Safety
- Timeout enforcement on
run/2via:timeoutoption - Output size limits via
:max_output_sizeoption - Platform support: macOS (Darwin) and Linux