Ferricstore.Cluster.DataSync (ferricstore v0.3.6)

Copy Markdown View Source

Shard-by-shard data directory copy for new node sync.

Provides WAL gap detection to avoid unnecessary full copies, per-shard sync status tracking, leader-aware copy source resolution, and automatic retry with partial cleanup on failure.

Summary

Functions

Reads the last applied Raft index from a ra meta.dets file on disk.

Retries sync_shard/3 up to max_retries times, cleaning up partial data on the target node between attempts.

Copies all shards sequentially, tracking per-shard sync status.

Syncs a single shard's data to a target node.

Pure WAL gap check: given the target's last applied index and the leader's first available WAL index, determines if WAL replay can bridge the gap.

Functions

needs_resync?(shard_index, target_node, leader_node)

@spec needs_resync?(non_neg_integer(), node(), node()) ::
  :wal_bridgeable | :needs_resync

read_last_applied_from_disk(data_dir, shard_index)

@spec read_last_applied_from_disk(binary(), non_neg_integer()) :: non_neg_integer()

Reads the last applied Raft index from a ra meta.dets file on disk.

Used when a node boots from a disk clone (EBS snapshot) — we need to know the Raft index the cloned data is consistent up to, BEFORE starting ra (since the WAL has the wrong node IDs and must be deleted).

Returns the last_applied index, or 0 if the file doesn't exist or can't be read.

retry_sync_shard(shard_index, target_node, ctx, max_retries \\ 3)

@spec retry_sync_shard(
  non_neg_integer(),
  node(),
  FerricStore.Instance.t(),
  non_neg_integer()
) ::
  {:ok, :wal_bridgeable | non_neg_integer()} | {:error, term()}

Retries sync_shard/3 up to max_retries times, cleaning up partial data on the target node between attempts.

Parameters

  • shard_index -- zero-based shard index
  • target_node -- the node to sync data to
  • ctx -- the FerricStore instance context
  • max_retries -- maximum number of attempts (default: 3)

sync_all_shards(target_node, ctx)

@spec sync_all_shards(node(), FerricStore.Instance.t()) ::
  {:ok, map()} | {:error, term()}

Copies all shards sequentially, tracking per-shard sync status.

Returns {:ok, results} when every shard succeeds, where results is a map of shard_index => {:synced, detail}. On partial failure returns {:error, {:partial_sync, results}} with per-shard success/failure info.

Parameters

  • target_node -- the node to sync data to
  • ctx -- the FerricStore instance context

sync_shard(shard_index, target_node, ctx)

@spec sync_shard(non_neg_integer(), node(), FerricStore.Instance.t()) ::
  {:ok, :wal_bridgeable | non_neg_integer()} | {:error, term()}

Syncs a single shard's data to a target node.

Resolves the current leader for the shard and copies data FROM the leader (not from the local node). Before copying, checks whether the target can catch up via WAL replay alone -- if so, the expensive data copy is skipped.

  1. Find leader for the shard
  2. Check WAL bridgeability
  3. If resync needed: pause writes, copy data + ra dir, resume writes
  4. Return {:ok, detail} with :wal_bridgeable or the Raft index at copy time

Parameters

  • shard_index -- zero-based shard index
  • target_node -- the node to sync data to
  • ctx -- the FerricStore instance context

wal_bridgeable?(target_index, leader_first_index)

@spec wal_bridgeable?(non_neg_integer(), non_neg_integer()) ::
  :wal_bridgeable | :needs_resync

Pure WAL gap check: given the target's last applied index and the leader's first available WAL index, determines if WAL replay can bridge the gap.