erllama_cache_key (erllama v0.1.0)
View SourceSummary
Functions
Compute an effective fingerprint from a base model fingerprint and a list of attached LoRA adapters.
Variant taking a pre-encoded TokensBin (u32-LE per token, matching
encode_tokens/1). Used by the longest-prefix walk so a caller can
encode once and pass binary:part(AllTokensBin, 0, N*4)
sub-binaries per probe, avoiding the per-attempt list traversal +
list comprehension allocation. Sub-binaries are O(1) views, so this
turns the per-probe cost into just the SHA-256 work.
Types
-type components() :: #{fingerprint := <<_:256>>, quant_type := quant_type(), ctx_params_hash := <<_:256>>, tokens := [non_neg_integer()]}.
-type key() :: <<_:256>>.
-type quant_type() :: f32 | f16 | bf16 | q4_0 | q4_1 | q5_0 | q5_1 | q8_0 | q2_k | q3_k_s | q3_k_m | q3_k_l | q4_k_m | q4_k_s | q5_k_m | q5_k_s | q6_k | q8_k | iq1_s | iq1_m | iq2_xxs | iq2_xs | iq2_s | iq2_m | iq3_xxs | iq3_xs | iq3_s | iq3_m | iq4_nl | iq4_xs | atom().
Functions
-spec decode_tokens(binary()) -> [non_neg_integer()].
-spec effective_fingerprint(<<_:256>>, [{<<_:256>>, float()}]) -> <<_:256>>.
Compute an effective fingerprint from a base model fingerprint and a list of attached LoRA adapters.
LoRA changes the model's logits, not its inputs, so attached adapters must enter the cache key. Two requests on the same model with different adapter sets / scales must never collide or false-hit each other.
effective_fp = sha256(model_fp || sorted_pairs) where
sorted_pairs is the byte concatenation of
(adapter_sha256 || u64_le(scale_q32)) for every attached adapter,
sorted by adapter_sha256 for determinism. scale_q32 is the scale
multiplied by 2^32 and rounded to int64, so floating-point
representation isn't part of the key.
An empty adapter list returns the base fingerprint unchanged.
-spec encode_tokens([non_neg_integer()]) -> binary().
-spec make(components()) -> key().
-spec make(<<_:256>>, quant_type(), <<_:256>>, binary()) -> key().
Variant taking a pre-encoded TokensBin (u32-LE per token, matching
encode_tokens/1). Used by the longest-prefix walk so a caller can
encode once and pass binary:part(AllTokensBin, 0, N*4)
sub-binaries per probe, avoiding the per-attempt list traversal +
list comprehension allocation. Sub-binaries are O(1) views, so this
turns the per-probe cost into just the SHA-256 work.
-spec quant_atom(0..255) -> {ok, quant_type()} | {error, unknown_quant}.
-spec quant_byte(quant_type()) -> 0..255.