Skip to content

ADR 0004 — Math core: architecture-keyed AttentionKind dispatch

Constrains: FR-PLAN-002, FR-PLAN-003

Date: 2026-04-19 Status: Accepted

Context

hwLedger's differentiator is correctly handling MoE, MLA, hybrid attention, and SSM — every public VRAM calculator gets these wrong. The math core must dispatch on architecture shape, not on a one-size-fits-all dense-model formula with fudge factors.

Research brief 04 (docs/research/04-kv-formulas.md — pending archive) enumerates eight architecture families with distinct per-token state formulas:

KindFormula (bytes / token)
Mha2 · L · H · d · b
Gqa2 · L · H_kv · d · b
Mqa2 · L · 1 · d · b
Mla(kv_lora_rank + qk_rope_head_dim) · blayer-invariant in absorb mode
SlidingWindow { window }2 · L · H_kv · d · min(seq_len, window) · b
Ssm { state_size }state_size · L · bseq-invariant
Hybrid(Vec<LayerKind>)Σ over layers by kind
AttentionSink { sinks, window }2 · L · H_kv · d · (sinks + window) · b

Decision

The math core crate hwledger-core::math exposes an AttentionKind enum with one variant per architecture family, and a KvFormula trait with a single method:

text
trait KvFormula {
    fn bytes_per_token(&self, seq_len: u64, bytes_per_element: f64) -> f64;
}

Each AttentionKind variant carries its architecture-specific parameters (e.g. Mla { kv_lora_rank, qk_rope_head_dim }, Hybrid(Vec<LayerKind>)). The hwledger-arch crate owns the classify(&Config) -> AttentionKind mapping from HF config.json + GGUF headers + MLX configs.

Total memory is composed, not baked into one formula:

text
VRAM  =  W_weights(quant)
      +  O_runtime
      +  KV_seq · live_sequences
      +  A_prefill(batch, seq)

Where W_weights distinguishes resident vs active parameters for MoE (full model loaded, only a fraction active per token), and O_runtime is calibrated per backend (MLX, mistral.rs, vLLM).

Open-enum policy

AttentionKind is non-exhaustive. Adding a new variant for (e.g.) future MLA revisions or new hybrid layouts is an additive change guarded by:

  • A new LayerKind or top-level variant.
  • Property tests proving the new formula matches a vendored reference model within ±200 MB.
  • An ADR addendum describing provenance of the formula.

Consequences

  • Correctness over convenience. Downstream UI code must always render a breakdown (weights / KV / runtime / prefill / free); it cannot collapse to a single "VRAM" number without losing the MoE-vs-total and KV-vs-weights distinctions users need.
  • Classification is fallible. classify returns Result<AttentionKind, ClassifyError>; unknown configs surface as explicit errors, not silent dense-fallbacks (fail-loudly policy).
  • Test matrix grows per new family. Accepted: each new variant must land with its golden-test fixture.

Rejected alternatives

  • Single dense formula + per-model correction factor. This is what HF Accelerate and can-it-run-llm do. Fails for MoE and MLA at >50 % error margins.
  • Closed-enum with string-based fallback. Makes classification drift silent; contradicts fail-loudly.
  • Trait objects all the way down (no enum). Loses pattern-match exhaustiveness that catches missing implementations at compile time.

References

  • PLAN.md §5 Math core.
  • Research brief: KV / state formulas per architecture (to be archived at docs/research/04-kv-formulas.md).
  • DeepSeek-V2 MLA paper, Qwen3.6 config.json (layer_types), Jamba paper.
  • vLLM kv_cache_dtype flag; llama.cpp KV-quant options.

Released under the Apache 2.0 License.