phi_accrual_udp

View Source

Dedicated UDP socket source for phi_accrual. Escapes BEAM distribution head-of-line blocking that affects the bundled PhiAccrual.Source.DistributionPing reference source.

⚠️ Alpha — v0.1.x. Public API and wire format may change before v1.0 based on real-deployment feedback. The packet format is deliberately conservative (magic + version + flags) to enable future evolution without breaking on-the-wire compatibility.

Why a separate package

The core phi_accrual library is intentionally transport-agnostic. Heartbeat transports live in their own packages so consumers can mix and match — UDP for decision-grade detection, BEAM distribution for observability-grade, custom transports for application-specific signals. See the phi_accrual roadmap for the ecosystem rationale.

Quick start

# mix.exs
def deps do
  [
    {:phi_accrual, "~> 1.0"},
    {:phi_accrual_udp, "~> 0.1"}
  ]
end

In your supervision tree:

children = [
  {PhiAccrualUdp.Listener, port: 4370},
  {PhiAccrualUdp.Sender,
    targets: [{{10, 0, 0, 2}, 4370}, {{10, 0, 0, 3}, 4370}],
    interval_ms: 1_000}
]

Wire format (v1, 12 bytes fixed)

<<magic::16, version::8, flags::8, timestamp::64-unsigned>>

magic     = 0xCEA6   (identifies a phi_accrual UDP heartbeat)
version   = 0x01     (this format)
flags     = 0x00     (reserved, must be zero in v1)
timestamp = u64 ms   (sender's choice of clock; diagnostic only)

The receiver does not use the packet timestamp for the EWMA — it uses local monotonic receipt time, preserving phi_accrual's clock discipline. The packet timestamp is diagnostic-only (e.g., one-way delay computation when NTP-synced).

Telemetry

[:phi_accrual_udp, :listener, :started]
  metadata: %{port}

[:phi_accrual_udp, :listener, :passive]
  measurements: %{}
  metadata:     %{port}
  # emitted on each :udp_passive re-arm; observe ingress saturation

[:phi_accrual_udp, :sample, :received]
  measurements: %{packet_timestamp_ms}
  metadata:     %{node, peer}

[:phi_accrual_udp, :decode, :error]
  measurements: %{packet_size}
  metadata:     %{reason, peer}
  # reason ∈ [:wrong_size, :bad_magic, :unsupported_version, :reserved_flags_set]

[:phi_accrual_udp, :sender, :started]
  metadata: %{interval_ms, target_count}

[:phi_accrual_udp, :sender, :tick]
  measurements: %{sent, errors}

Security

UDP is unauthenticated. Anyone who can reach the listener port can send packets that pass Packet.decode/1 and corrupt detection. In hostile networks: bind to a private interface, firewall the port, or layer authentication via a node_resolver that rejects unknown peers.

Operational considerations

Node identity and Sender lifecycle

The default node_resolver returns {ip, port} of the packet's source. Combined with the bundled PhiAccrualUdp.Sender — which opens its socket on an ephemeral source port — this means:

  • Every Sender restart produces a new {ip, port} tuple.
  • The Listener treats the restarted Sender as a brand new peer.
  • The previous peer's estimator goes :stale (false positive on a peer that's actually fine).
  • The new peer's estimator restarts cold and spends 8 samples in :insufficient_data before φ is reported.
  • Estimator state proliferates over time as Senders cycle.

The same applies under NAT session timeout (UDP NAT sessions typically expire in 30–180s; 1s heartbeats keep them warm but a brief outage can recycle them) and under container restarts that change IP.

For production deployments, supply a :node_resolver that maps {ip, port} to a stable application-level identifier — node name, hostname, partner ID, whatever your topology provides:

resolver = fn
  {10, 0, 0, 1}, _ -> :node_a
  {10, 0, 0, 2}, _ -> :node_b
  ip, port ->
    # Reject unknown peers — also a useful security boundary.
    {:reject, {ip, port}}
end

{PhiAccrualUdp.Listener, port: 4370, node_resolver: resolver}

The default {ip, port} resolver is appropriate for development, demos, and deployments where you control the full Sender lifecycle and accept that restart = new peer.

DNS resolution in Sender

PhiAccrualUdp.Sender resolves hostname targets on every tick via :gen_udp.send/4. This is deliberate: rolling DNS changes (cluster reconfig, container replacement) propagate without a Sender restart.

The cost is one resolver lookup per target per interval. The OS resolver caches by default, so almost all hits are local. At 50 targets and a 1-second interval that is 50 lookups/sec, almost all cached — negligible in normal operation.

The risk: if the resolver is slow or unreachable, every tick can stall in :gen_udp.send/4. The Sender is a single GenServer, so a slow lookup blocks all targets for that tick. Symptoms: [:phi_accrual_udp, :sender, :tick] telemetry shows degraded sent counts; receivers see heartbeat gaps and elevated φ.

For deployments where DNS reliability is uncertain, prefer pre-resolved IP tuples in the :targets list:

{PhiAccrualUdp.Sender,
  targets: [{{10, 0, 0, 2}, 4370}, {{10, 0, 0, 3}, 4370}],
  interval_ms: 1_000}

IP tuples skip the resolver entirely. Trade off: you lose dynamic DNS updates and must restart the Sender to pick up topology changes.

License

Apache-2.0.