Ferricstore.Store.BitcaskCheckpointer (ferricstore v0.4.1)

Copy Markdown View Source

Per-shard background fsync for Bitcask data files.

Replaces the per-apply v2_fsync in StateMachine.flush_pending_writes and the old shard-level fsync_needed deferred fsync timer. One shared mechanism, one shared flag (atomics on the Instance), covering all write paths (WARaft state machine + direct test-instance Bitcask writes).

Correctness

The WARaft segment log is the source of truth for client-visible durability. Writes hit Bitcask data files via v2_append_batch_nosync (page cache only). On a crash, the WARaft log replays any post-checkpoint entries and rebuilds the Bitcask state exactly — no acknowledged data is lost.

The checkpointer's job is to move data from page cache to disk on a predictable cadence, bounding replay time after kernel panic.

Algorithm

every checkpoint_interval_ms:
  if :atomics.get(checkpoint_flags, idx+1) == 1:
    {_fid, active_path, _sp} = ActiveFile.get(idx)
    :atomics.put(checkpoint_in_flight, idx+1, 1)
    :atomics.put(checkpoint_flags, idx+1, 0)
    NIF.v2_fsync_async(self(), corr_id, active_path)
  else: skip (idle shard  no syscalls)

The in-flight marker is set before clearing the dirty flag. That avoids a false-clean window where Raft could release log entries while Bitcask bytes are still only in page cache. A writer that arrives during fsync re-sets the dirty flag, so the next tick picks it up. The current fsync may miss bytes from that concurrent write, which is fine because the WARaft segment log is authoritative.

On fsync error (disk full, I/O error), we re-set the flag so the next tick retries, and raise DiskPressure to shed writes.

Configuration

  • :checkpoint_interval_ms (default 10_000 = 10s) — how often to check the flag. The WARaft segment log is fdatasync'd per batch and is the source of truth for acknowledged writes, so a large interval is safe: on kernel panic we replay up to one interval's worth of WARaft log entries and rebuild Bitcask exactly. Short intervals mean more fsync syscalls per shard for no durability gain.
  • :checkpoint_idle_ms (default 250ms) — if writes are still moving when a tick fires, defer the active-file fsync until the shard has been idle for this long.
  • :checkpoint_max_delay_ms (default 180_000ms) — force fsync after this much dirty time even under continuous writes, bounding replay.

Summary

Functions

Returns a specification to start this module under a supervisor.

Canonical process name for the checkpointer of a given shard.

Forces a synchronous fsync of the shard's active file right now. Used by graceful shutdown (see design doc §shutdown ordering) and by tests. Bypasses the async path and clears the dirty flag on success.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

process_name(index, arg2)

@spec process_name(non_neg_integer(), map() | nil) :: atom()

Canonical process name for the checkpointer of a given shard.

start_link(opts)

@spec start_link(keyword()) :: GenServer.on_start()

sync_now(server)

@spec sync_now(pid() | atom()) :: :ok | {:error, term()}

Forces a synchronous fsync of the shard's active file right now. Used by graceful shutdown (see design doc §shutdown ordering) and by tests. Bypasses the async path and clears the dirty flag on success.