Skip to content

Grouped Query Attention (GQA)

Reduces KV cache size by sharing keys and values across multiple query heads.

Formula

For h query heads, g key-value groups (g << h):

GQA(Q,K,V)=Concat(head1,,headh)WO

where heads are grouped into g blocks, each sharing the same K, V projection:

headi=Attention(QiWiQ,KWK,VWV)

Ratio reduction: standard MHA has h KV heads; GQA has g, achieving h/g compression.

Why this variant

GQA exists because MHA's KV cache grew linearly with every query head — unaffordable past 7B model scale at 32K context — while MQA's single shared K/V head degraded quality noticeably on reasoning benchmarks. GQA interpolates between the two and lands on the empirically correct ratio (typically 4–8 KV heads per 32 query heads). It was formalized in Ainslie et al., 2023, shipped in production in Llama 2 and Mistral 7B, and refined in Llama 3 (2024) and Llama 4 (2025–2026).

hwLedger accounting gotcha. num_key_value_heads lives at the top level of config.json for HuggingFace-style configs but inside llama.attention.head_count_kv for GGUF metadata. hwledger-arch reads both; if you are writing a new classifier path, do not assume the HF name.

Memory footprint (32K context, 7B model)

Mistral 7B → Mistral-7B-Instruct-v0.2 with GQA (8 KV heads instead of 32):

  • KV cache reduction: 32/8 = 4x smaller
  • Old: 268 MB/layer × 32 layers = 8.6 GB
  • New: 67 MB/layer × 32 layers = 2.1 GB
  • Decode latency: minimal (same attention computation on fewer K, V vectors)

Which models use it

  • Mistral 7B Instruct v0.2 (8 KV heads, 32K context)
  • LLaMA 2-Chat (upgrade path over LLaMA 2)
  • Llama-3 (70B variant, 8 KV heads)
  • Phi-3 (Microsoft, 4K context, 32 heads, 8 KV groups)

hwLedger variant

AttentionKind::GQA { num_kv_heads } — stores explicit KV head count. Enables dynamic planning: fewer KV heads = smaller cache = longer context or batch size.

Worked example: 32K context

Model: Mistral-7B-Instruct-v0.2

  • Query heads: 32, KV heads: 8
  • KV cache per layer: 32K tokens × (256 + 256) × 2 bytes × 8 = 131 MB/layer
  • Full cache (32 layers): 4.2 GB
  • Speedup vs MHA: ~15-20% decode (reduced KV computation)

GQA vs MHA baseline (32K context, FP16)

Modelkv_headsq_headsKV/layerFull cache (layers)vs MHA
Llama-3-8B (GQA)83264 MiB2.0 GiB (32L)4× smaller
Llama-3-70B (GQA)86464 MiB5.0 GiB (80L)8× smaller
Llama-2-7B MHA baseline3232256 MiB8.0 GiB (32L)

2026 citations

Released under the Apache 2.0 License.