Multi-Query Attention (MQA)

Extreme KV cache compression: all query heads share a single K, V head.

Formula

MQA (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

where all heads share one (K, V) projection:

{head}_{i} = Attention (Q_{i} W_{i}^{Q}, K W^{K}, V W^{V})

Compression ratio: h to 1 (h query heads, 1 KV head).

Why this variant

MQA addresses MHA's decoder-side bottleneck: during autoregressive generation each new token reads the entire KV cache, and MHA's per-head K,V make that bandwidth-bound on most GPUs. MQA reduces K,V reads to 1/h. It was introduced in Shazeer, 2019 and deployed in PaLM (Chowdhery et al., 2022). It has since been superseded by GQA in most production 2024–2026 models because a single shared K,V is demonstrably lower quality on long-context reasoning; MQA is retained for extreme memory-constrained deployments and some Falcon variants.

hwLedger accounting gotcha. AttentionKind::MQA is a stable variant but hwLedger will not auto-classify a model as MQA purely from num_key_value_heads == 1; the classifier requires an explicit attention_type hint or a known model family, because some MLA-projected configs also show num_key_value_heads == 1 post-projection. See hwledger-arch fixtures for the disambiguation tests.

Memory footprint (32K context, 7B model)

LLaMA 2-Chat → hypothetical MQA variant:

KV cache per layer: 32K × (256 + 256) × 2 × 1 = 33 MB/layer
Full cache: 1 GB for all 32 layers
Batch size expansion: 5-10x larger batch before OOM

Which models use it

PaLM (Google, original MQA baseline)
T5 v2 (limited adoption, better results with GQA)
Falcon 40B (some variants)

MQA has largely been superseded by GQA (better accuracy-efficiency tradeoff).

hwLedger variant

AttentionKind::MQA — single KV head, maximal compression. Used when planning for memory-constrained rentals (Lambda, tiny VMs).

Worked example: 32K context

Hypothetical 7B model with MQA:

All 32 query heads share 1 KV head
KV cache: 32K × 512 × 2 × 1 = 32.8 MB/layer
Full 32 layers: 1.05 GB total
Trade-off: slightly lower quality attention (all heads attend same K, V)

MQA vs MHA baseline (32K context, FP16)

Model	kv_heads	q_heads	KV/layer	Full cache (32L)	vs MHA
Falcon-40B-like MQA	1	64	32 MiB	1.0 GiB	64× smaller
Hypothetical 7B MQA	1	32	32 MiB	1.0 GiB	32× smaller
Llama-2-7B MHA baseline	32	32	256 MiB	8.0 GiB	1×

2026 citations

Shazeer, 2019: "Fast Transformer Decoding: One Write-Head is All You Need" — original MQA

Multi-Query Attention (MQA) ​

Formula ​

Why this variant ​

Memory footprint (32K context, 7B model) ​

Which models use it ​

hwLedger variant ​

Worked example: 32K context ​

MQA vs MHA baseline (32K context, FP16) ​

2026 citations ​

Related ​