Crash-safe publish protocol

View Source

This is the contract between the writer pool, the meta server, and the disk tier. It exists because cache files must appear atomically to readers, must never be partially written, and must clean up gracefully when a writer crashes mid-save.

The five-stage protocol

 writer process 
                                                                 
  1.  reserve_save(Key, Tier, Bytes)                             
         meta server inserts a reservation row,                  
         returns a fresh ReservationToken.                       
                                                                 
  2.  write_tmp(Path.tmp, Bytes)                                 
         streamed prim_file:write/2; fdatasync at end.           
                                                                 
  3.  check_reservation(Key, ReservationToken)                   
         meta server confirms the reservation is still live      
         (no concurrent writer has superseded us).               
                                                                 
  4.  link(Path.tmp, Path)                                       
         atomic create-if-not-exists; the only durable           
         publish step. EEXIST is validated and either            
         adopted or replaced under the current reservation.      
                                                                 
  5.  mark_published(Key, ReservationToken)                      
         meta server flips the reservation row to a              
         published meta row, then announces to subscribers.      
                                                                 

Why each stage exists

Stage 1: reserve

Prevents two concurrent writers for the same key from both reaching stage 4 and racing on link(2). The reservation is a row in the meta ETS keyed on Key, with the writer's pid and a monotonic token. Subsequent reserve_save calls for the same key during the window block or fail-fast depending on policy.

Stage 2: write_tmp

The temp file lives next to the final path (Path.tmp.<pid>.<token>) so link(2) is always intra-directory (no cross-fs hops). We fdatasync the temp file before stage 3 to ensure stage 4's atomic publish actually publishes durable bytes — otherwise a crash between stage 4 and the fs's writeback could expose a zero-length file as the canonical row.

Stage 3: check_reservation

The window between stage 1 and stage 4 can be tens of milliseconds on slow disks. In that window:

  • The writer could have been killed and respawned by the supervisor. A second reservation for the same key may exist with a fresh token.
  • An evict could have decided this key is dead.

check_reservation is the meta server confirming our token is still the live one before we publish. A negative result means "someone else is doing this; abandon and clean up".

link(2) is the durable publish. It is atomic create-if-not-exists on POSIX file systems: either the file did not exist and we created the hardlink, or the file existed and we got EEXIST. We never silently swallow EEXIST — we open the existing file and either:

  • Adopt it if the parsed header matches our reservation token's expected key + size. This handles the case where a previous writer crashed after link but before mark_published; the file is good, just unannounced.
  • Replace it if it parses as a different key (collision, very rare) or fails to parse. We unlink and link our temp again.

Adopt-or-replace is the only path that is correct under writer crashes and orphan files. Skipping EEXIST is a footgun; specifically forbidden in AGENTS.md.

Stage 5: mark_published

The reservation flips to a published meta row in a single gen_server call. Subscribers (multi-turn waiters from session_resume_wait_ms) are notified by gen_server:reply/2 to the parked callers. Reads now find the key on the index.

Two-stage TTL cleanup

Reservations have a TTL. A writer that crashes between stages 1 and 5 leaves a stale reservation; the meta server reaps them in two stages:

  1. TTL elapsed, stage ≤ 3: drop the reservation, no file action needed (writer hadn't published yet).
  2. TTL elapsed, stage 4 reached: check disk. If the linked file exists and parses, adopt it under a fresh reservation. If it fails to parse, unlink and drop.

This is what makes orphan adoption work: a writer can crash arbitrarily late and we still recover the bytes it wrote.

Read path: plain read I/O

The disk tier reads cache files via file:read_file/1 into a fresh BEAM heap binary. mmap was an option in earlier revisions but was removed: the process already mmaps the GGUF (multi-GB), and a region binary that survives the NIF call would expose the BEAM to SIGBUS from any external truncation. See internals/nif-safety.md for the fuller rationale.

Test surface

Each invariant has a dedicated case:

InvariantTest
EEXIST adopted when header matcheserllama_cache_writer_tests:eexist_adopt/0
EEXIST replaced when header is junkerllama_cache_writer_tests:eexist_replace/0
Stale reservation reaped, file adoptederllama_cache_meta_SUITE:ttl_orphan_adopt/0
Stale reservation reaped, file unlinkederllama_cache_meta_SUITE:ttl_orphan_drop/0
Concurrent writers: one wins, others abandonprop_cache_publish:prop_publish_serialises/0
Multi-turn parent_key waits for in-flight finisherllama_cache_tests:resume_waits_for_finish/0

If you change any stage of the protocol, surface tension with these tests to a reviewer before landing.