Skip to content

KV Cache Formulas

The math core is the soul of hwLedger. This document walks through the derivation of KV cache formulas per attention architecture, with interactive breakdowns and live calculation.

Overview

KV cache size depends on the attention mechanism (AttentionKind):

  • Multi-Head Attention (MHA): Full K and V for all heads
  • Grouped-Query Attention (GQA): Shared K and V across head groups
  • Multi-Query Attention (MQA): Single shared K and V
  • Multi-Head Latent Attention (MLA): Projected latent space
  • Sliding Window Attention: Fixed-size context window
  • State Space Models (SSM/Mamba): Constant state per layer
  • Hybrid Attention: Mix of patterns per layer
  • Sink Tokens: Fixed sink cache + sliding window

Incorrect calculation costs hours of debugging and wasted VRAM. hwLedger derives formulas per architecture directly from the config.json fields.

Live MLA classification and per-layer VRAM breakdown
Live MLA classification and per-layer VRAM breakdown
Keyframes (0 — VLM-friendly)

Loading manifest…

KV Cache Formula Derivation

1. Multi-Head Attention (MHA)

Structure: Full keys and values for all num_heads heads.

K_cache = [batch_size, seq_length, num_heads, head_dim]
V_cache = [batch_size, seq_length, num_heads, head_dim]

Bytes per token:

\text{KV cache per token} = 2 \times \text{batch_size} \times \text{num_heads} \times \text{head_dim} \times \text{dtype_bytes}

Example: Gemma 3 (27B, global attention layers) with batch_size=1, seq_length=4096, dtype=float16:

  • num_heads = 32
  • head_dim = 256 / 32 = 8 (estimated from 256-dim model)
  • KV per token = 2 × 1 × 32 × 8 × 2 bytes = 1,024 bytes
  • Total for 4K ctx: 1,024 × 4,096 ≈ 4.1 MB per layer × 27 layers ≈ 111 MB

Note: Gemma 3 uses hybrid 5:1 interleaved attention (local windows + global), reducing effective KV cache vs pure MHA.

2. Grouped-Query Attention (GQA)

Structure: Keys and values grouped across num_key_value_heads heads (< num_attention_heads).

K_cache = [batch_size, seq_length, num_key_value_heads, head_dim]
V_cache = [batch_size, seq_length, num_key_value_heads, head_dim]

Bytes per token:

\text{KV cache per token} = 2 \times \text{batch_size} \times \text{num_key_value_heads} \times \text{head_dim} \times \text{dtype_bytes}

Compression ratio: num_attention_heads / num_key_value_heads

Example: Llama 4 Maverick (17B active) with GQA, batch_size=1, dtype=float16:

  • num_attention_heads = 64 (inferred from architecture)
  • num_key_value_heads = 8 (8× compression vs MHA)
  • head_dim = 128 (estimated)
  • KV per token = 2 × 1 × 8 × 128 × 2 bytes = 4,096 bytes
  • Total for 4K ctx: 4,096 × 4,096 ≈ 16.4 MB per layer × 48 layers ≈ 787 MB

GQA saves 8× vs MHA. Llama 4 uses MoE-aware routing; true active parameters ≈17B vs 400B total.

3. Multi-Query Attention (MQA)

Structure: Single shared K and V across all heads.

K_cache = [batch_size, seq_length, 1, head_dim]
V_cache = [batch_size, seq_length, 1, head_dim]

Bytes per token:

\text{KV cache per token} = 2 \times \text{batch_size} \times \text{head_dim} \times \text{dtype_bytes}

Compression ratio: num_attention_heads (maximum compression)

Example: With batch_size=1, dtype=float16:

  • head_dim = 64
  • KV per token = 2 × 1 × 64 × 2 bytes = 256 bytes
  • Total for 4K ctx: 256 × 4,096 ≈ 1.0 MB per layer × 80 layers ≈ 80 MB

MQA saves 64–80× vs MHA.

4. Multi-Head Latent Attention (MLA)

Structure (Qwen2.5, Llama3.1): Keys and values projected to a low-dimensional latent space.

latent_dim = hidden_size / num_heads  (e.g., 4096 / 64 = 64)

Bytes per token:

\text{KV cache per token} = 2 \times \text{batch_size} \times \text{latent_dim} \times \text{dtype_bytes}

Compression ratio: (num_heads × head_dim) / latent_dim ≈ num_heads

Example: DeepSeek-V3 with MLA, batch_size=1, dtype=float16:

  • kv_lora_rank = 512 (projection rank)
  • qk_rope_head_dim = 64 (per-head rope dimension)
  • KV per token = (512 + 64) × 2 bytes = 1,152 bytes
  • Total for 4K ctx: 1,152 × 4,096 ≈ 4.6 MB per layer × 61 layers ≈ 281 MB

MLA saves ~15–20× vs dense attention, beating GQA on long-context tasks. DeepSeek-V3 combines MLA + DeepSeekMoE (sparse routing) for efficient inference.

5. Sliding Window Attention (Mistral)

Structure: Only the last window_size tokens are cached.

K_cache = [batch_size, min(seq_length, window_size), num_heads, head_dim]
V_cache = [batch_size, min(seq_length, window_size), num_heads, head_dim]

Bytes per token (amortized):

\text{KV cache per token} = 2 \times \text{batch_size} \times \min(\text{seq_length}, \text{window_size}) \times \text{head_dim} \times \text{dtype_bytes}

Example: Gemma 3 with 5:1 interleaved attention (local window=1024, global every 5th layer), batch_size=1, dtype=float16:

  • Local window is bounded at 1,024 tokens per layer (80% of layers)
  • Global layers attend full context (20% of layers)
  • num_heads = 32
  • head_dim = 128 (estimated)
  • Local KV = 2 × 1 × 1,024 × 32 × 128 × 2 bytes ≈ 16.8 MB per layer
  • Effective KV (weighted): 0.8 × 16.8 + 0.2 × 67 ≈ 33 MB/layer × 27 layers ≈ 891 MB
  • Savings vs 128K full MHA: ~15×

6. State Space Models (SSM / Mamba)

Structure: Constant state vector per token (no sequence dependency).

state = [batch_size, hidden_size]

Bytes per state (amortized):

\text{State size} = \text{batch_size} \times \text{hidden_size} \times \text{dtype_bytes}

Example: Mamba-3 (state_size=64), batch_size=1, dtype=float32:

  • state_size = 64 (MIMO formulation, 2× smaller than Mamba-2)
  • State = 1 × 64 × 4 bytes = 256 bytes per layer
  • Total for 48 layers: 256 × 48 ≈ 12 KB

Savings vs MHA: ~5,600× (constant state, independent of sequence length). Mamba-3 achieves parity with Mamba-2 perplexity at half the state size via MIMO decoder.

7. Hybrid Attention

Structure: Mix of MHA and sliding window across layers.

Some layers use full MHA; others use sliding window. Per-layer check:

if layer.attention_type == "sliding_window":
    cache_size = sliding_window_formula(...)
else:
    cache_size = mha_formula(...)

Example: Mixtral-8x7B (hybrid):

  • Layers 0–20: MHA (full cache)
  • Layers 21–31: Sliding window (bounded cache)
  • Total KV = (21 × mha_formula) + (12 × sliding_formula)

8. Sink Tokens (Palm 2, LLaMA-long)

Structure: Fixed sink cache + sliding window for recent context.

sink_cache = [batch_size, num_sink_tokens, num_heads, head_dim]
sliding_cache = [batch_size, window_size, num_heads, head_dim]

Bytes per token:

\text{KV cache} = 2 \times \text{batch_size} \times (\text{num_sink} + \text{window}) \times \text{head_dim} \times \text{dtype_bytes}

Example: With num_sink=4, window=4092, seq_len=32K:

  • Cache is bounded at 4,096 tokens (4 + 4092)
  • Saves 8× vs full 32K MHA

MoE-Aware Calculation

For Mixture-of-Experts models (Mixtral, Qwen-MoE):

  • Resident parameters: Always loaded (shared layers, router)
  • Active parameters: Only active experts loaded per forward pass

KV cache is computed on all parameters, not active-only, because it depends on sequence length, not expert sparsity.

total_kv = (num_layers × per_layer_kv)

Interactive Breakdowns

Use the hwLedger planner to see live per-layer breakdowns:

bash
cargo run --bin hwledger-cli -- plan \
  --model meta-llama/Llama-2-70b \
  --batch-size 1 \
  --seq-length 4096

Output (example):

Layer 0: MHA | KV: 67 MB | Params: 81.9 GB | Total: 81.9 GB
Layer 1: MHA | KV: 67 MB | Params: 81.9 GB | Total: 81.9 GB
...
Layer 79: MHA | KV: 67 MB | Params: 81.9 GB | Total: 81.9 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total KV Cache: 5.3 GB | Weights: 6.5 TB | Unified Memory: 5.3 GB + 6.5 TB

Comparison Table (April 2026)

ArchitectureCompression vs MHAExample (Apr 2026)ReleaseKV at 4K ctx (1 batch)Source
MHAGemma 3 (27B, local attn layers)Mar 202667 MB/layerarXiv:2503.19786
GQALlama 4 Maverick (17B active, 400B total)2026-048.4 MB/layerMeta Llama Blog
MQA64×Jamba-1.5-Mini (12B active, hybrid)2024-111.0 MB/layerarXiv:2408.12570
MLADeepSeek-V3 (kv_lora_rank=512)2025-12~3.3 KB/tokenDeepSeek Config
Hybrid AttnmixedQwen 3.6 Plus (GDN+softmax, 256 experts)Mar 2026variesGitHub
Hybrid Attn+MambamixedJamba-1.5-Large (94B active, 72 layers)2024-11variesAI21
SSM/Mamba-31000×Mamba-3 (state_size=64)Mar 2026~256 KB (total)arXiv:2603.15569
Interleaved AttnvariesGemma 3 (5:1 local/global, 128K ctx)Mar 2026~34 MB (window=1024)arXiv:2503.19786

Key Takeaways

  1. One size does not fit all: Each architecture has different KV scaling characteristics.
  2. GQA is common: Llama-2, Mistral, Qwen use GQA; saves 7–8× vs MHA.
  3. MLA is efficient: New standard in Qwen2.5+ and Llama3.1; competes with GQA.
  4. Sliding window limits cache: Mistral and Mixtral cap cache at 4K tokens.
  5. SSMs are cache-free: Mamba and similar don't grow cache with sequence length.
  6. Per-layer check: Always verify config.json for the exact mechanism per layer.

References (Updated April 19, 2026)

2026 Models

DeepSeek & Attention Variants

Qwen & Hybrid Architectures

Foundational (2023–2024)

Released under the Apache 2.0 License.