This page keeps only the latest public benchmark summaries. Raw benchmark logs and one-off profiling runs are intentionally not committed.

Use these numbers as reproducible reference points, not universal hardware claims. Throughput and latency depend on VM type, local NVMe availability, shard count, client concurrency, pipeline depth, payload size, and resource guards.

FerricFlow: latest Azure runs

Workload shape:

1,000,000 flows
single FerricStore server VM
single Python SDK client VM
Flow queue/workflow workers
live mode: create and process run together
preloaded mode: create first, then process

The best balanced 16-vCPU server result was with 32 Flow shards.

ModeAPI shapeServer shardsCreate rateProcess/complete rateEnd-to-end rate
Sync liveQueue worker32--53,790 flows/s
Sync liveWorkflow worker32--54,060 workflows/s
Async liveQueue worker3295,896 flows/s-45,608 flows/s
Async liveWorkflow worker3297,196 workflows/s-47,888 workflows/s

Preloaded 16-vCPU runs:

API shapeServer shardsCreate rateProcess/complete rateEnd-to-end rate
Queue worker3299,645 flows/s125,161 flows/s55,478 flows/s
Workflow worker3299,892 workflows/s101,080 workflows/s50,241 workflows/s

Server CPU scale

These runs used default server behavior and live 1M-flow workloads.

Server sizeSync queueSync workflowAsync queueAsync workflow
4 vCPU15,854/s16,005/sfailed under write timeoutfailed under write timeout
8 vCPU30,113/s27,674/s23,882/s24,712/s
16 vCPU46,964/s45,375/s41,131/s41,121/s

16-vCPU shard sweep

Sync live runs:

Server shardsQueue end-to-endWorkflow end-to-end
1646,964/s45,375/s
2451,644/s51,977/s
3253,790/s54,060/s
6454,287/s53,736/s

Async live runs:

Server shardsQueue createQueue end-to-endWorkflow createWorkflow end-to-end
1686,892/s41,131/s90,504/s41,121/s
3295,896/s45,608/s97,196/s47,888/s
6496,219/s43,997/s95,195/s45,137/s

Interpretation: 32 shards was the best balanced setting in these Azure runs. 64 shards slightly improved queue-only sync throughput, but 32 shards was better for the workflow mix.

KV SET/GET: latest Azure local-NVMe runs

Environment:

server: Azure Standard_L4as_v4, 4 vCPU
client: Azure Standard_D2as_v4
storage: local NVMe, ext4, noatime/nodiratime, scheduler=none
protocol: RESP3 over TCP
value size: 256 bytes
client load: memtier, 200 connections, pipeline 30, 4 threads, 30 seconds
OperationDurability/read modeThroughputp50 latencyp99 latencyp99.9 latency
SETQuorum durable write175,650 ops/s129 ms338 ms391 ms
SETAsync write267,935 ops/s59 ms414 ms532 ms
GETQuorum read592,684 ops/s37 ms125 ms194 ms
GETAsync read387,057 ops/s36 ms200 ms11,403 ms

The async-read p99.9 had a large outlier in this run. Treat p50/p99 as the useful signal for that row until the run is repeated.

A higher-pipeline quorum SET run on the same 4-vCPU NVMe setup reached:

OperationClient loadThroughputp50 latencyp99 latencyp99.9 latency
SET quorum200 connections, pipeline 50200,969 ops/s197.63 ms339.97 ms423.94 ms

Single-command unloaded durable SET latency on the same NVMe server:

n=30
min=7.724 ms
avg=8.129 ms
p50=8.054 ms
max=9.296 ms

Reproducing the shapes

FerricFlow benchmarks are run from the Python SDK repository with the optimized queue/workflow benchmark scripts. KV benchmarks use memtier_benchmark with RESP3, 256-byte values, and the connection/pipeline settings above.

For public reporting, prefer the 1M-flow live results and the 200-connection, pipeline-30 KV table. They are less bursty than short or preloaded-only runs.