Skip to content

Multi-Query Attention (MQA)

Extreme KV cache compression: all query heads share a single K, V head.

Formula

MQA(Q,K,V)=Concat(head1,,headh)WO

where all heads share one (K, V) projection:

headi=Attention(QiWiQ,KWK,VWV)

Compression ratio: h to 1 (h query heads, 1 KV head).

Why this variant

MQA addresses MHA's decoder-side bottleneck: during autoregressive generation each new token reads the entire KV cache, and MHA's per-head K,V make that bandwidth-bound on most GPUs. MQA reduces K,V reads to 1/h. It was introduced in Shazeer, 2019 and deployed in PaLM (Chowdhery et al., 2022). It has since been superseded by GQA in most production 2024–2026 models because a single shared K,V is demonstrably lower quality on long-context reasoning; MQA is retained for extreme memory-constrained deployments and some Falcon variants.

hwLedger accounting gotcha. AttentionKind::MQA is a stable variant but hwLedger will not auto-classify a model as MQA purely from num_key_value_heads == 1; the classifier requires an explicit attention_type hint or a known model family, because some MLA-projected configs also show num_key_value_heads == 1 post-projection. See hwledger-arch fixtures for the disambiguation tests.

Memory footprint (32K context, 7B model)

LLaMA 2-Chat → hypothetical MQA variant:

  • KV cache per layer: 32K × (256 + 256) × 2 × 1 = 33 MB/layer
  • Full cache: 1 GB for all 32 layers
  • Batch size expansion: 5-10x larger batch before OOM

Which models use it

  • PaLM (Google, original MQA baseline)
  • T5 v2 (limited adoption, better results with GQA)
  • Falcon 40B (some variants)

MQA has largely been superseded by GQA (better accuracy-efficiency tradeoff).

hwLedger variant

AttentionKind::MQA — single KV head, maximal compression. Used when planning for memory-constrained rentals (Lambda, tiny VMs).

Worked example: 32K context

Hypothetical 7B model with MQA:

  • All 32 query heads share 1 KV head
  • KV cache: 32K × 512 × 2 × 1 = 32.8 MB/layer
  • Full 32 layers: 1.05 GB total
  • Trade-off: slightly lower quality attention (all heads attend same K, V)

MQA vs MHA baseline (32K context, FP16)

Modelkv_headsq_headsKV/layerFull cache (32L)vs MHA
Falcon-40B-like MQA16432 MiB1.0 GiB64× smaller
Hypothetical 7B MQA13232 MiB1.0 GiB32× smaller
Llama-2-7B MHA baseline3232256 MiB8.0 GiB

2026 citations

Released under the Apache 2.0 License.