ADR 0004 — Math core: architecture-keyed AttentionKind dispatch
Constrains: FR-PLAN-002, FR-PLAN-003
Date: 2026-04-19 Status: Accepted
Context
hwLedger's differentiator is correctly handling MoE, MLA, hybrid attention, and SSM — every public VRAM calculator gets these wrong. The math core must dispatch on architecture shape, not on a one-size-fits-all dense-model formula with fudge factors.
Research brief 04 (docs/research/04-kv-formulas.md — pending archive) enumerates eight architecture families with distinct per-token state formulas:
| Kind | Formula (bytes / token) |
|---|---|
Mha | 2 · L · H · d · b |
Gqa | 2 · L · H_kv · d · b |
Mqa | 2 · L · 1 · d · b |
Mla | (kv_lora_rank + qk_rope_head_dim) · b — layer-invariant in absorb mode |
SlidingWindow { window } | 2 · L · H_kv · d · min(seq_len, window) · b |
Ssm { state_size } | state_size · L · b — seq-invariant |
Hybrid(Vec<LayerKind>) | Σ over layers by kind |
AttentionSink { sinks, window } | 2 · L · H_kv · d · (sinks + window) · b |
Decision
The math core crate hwledger-core::math exposes an AttentionKind enum with one variant per architecture family, and a KvFormula trait with a single method:
trait KvFormula {
fn bytes_per_token(&self, seq_len: u64, bytes_per_element: f64) -> f64;
}Each AttentionKind variant carries its architecture-specific parameters (e.g. Mla { kv_lora_rank, qk_rope_head_dim }, Hybrid(Vec<LayerKind>)). The hwledger-arch crate owns the classify(&Config) -> AttentionKind mapping from HF config.json + GGUF headers + MLX configs.
Total memory is composed, not baked into one formula:
VRAM = W_weights(quant)
+ O_runtime
+ KV_seq · live_sequences
+ A_prefill(batch, seq)Where W_weights distinguishes resident vs active parameters for MoE (full model loaded, only a fraction active per token), and O_runtime is calibrated per backend (MLX, mistral.rs, vLLM).
Open-enum policy
AttentionKind is non-exhaustive. Adding a new variant for (e.g.) future MLA revisions or new hybrid layouts is an additive change guarded by:
- A new
LayerKindor top-level variant. - Property tests proving the new formula matches a vendored reference model within ±200 MB.
- An ADR addendum describing provenance of the formula.
Consequences
- Correctness over convenience. Downstream UI code must always render a breakdown (weights / KV / runtime / prefill / free); it cannot collapse to a single "VRAM" number without losing the MoE-vs-total and KV-vs-weights distinctions users need.
- Classification is fallible.
classifyreturnsResult<AttentionKind, ClassifyError>; unknown configs surface as explicit errors, not silent dense-fallbacks (fail-loudly policy). - Test matrix grows per new family. Accepted: each new variant must land with its golden-test fixture.
Rejected alternatives
- Single dense formula + per-model correction factor. This is what HF Accelerate and can-it-run-llm do. Fails for MoE and MLA at >50 % error margins.
- Closed-enum with string-based fallback. Makes classification drift silent; contradicts fail-loudly.
- Trait objects all the way down (no enum). Loses pattern-match exhaustiveness that catches missing implementations at compile time.
References
PLAN.md§5 Math core.- Research brief: KV / state formulas per architecture (to be archived at
docs/research/04-kv-formulas.md). - DeepSeek-V2 MLA paper, Qwen3.6
config.json(layer_types), Jamba paper. - vLLM
kv_cache_dtypeflag; llama.cpp KV-quant options.