Grouped Query Attention (GQA)

Reduces KV cache size by sharing keys and values across multiple query heads.

Formula

For h query heads, g key-value groups (g << h):

GQA (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

where heads are grouped into g blocks, each sharing the same K, V projection:

{head}_{i} = Attention (Q_{i} W_{i}^{Q}, K W^{K}, V W^{V})

Ratio reduction: standard MHA has h KV heads; GQA has g, achieving h/g compression.

Why this variant

GQA exists because MHA's KV cache grew linearly with every query head — unaffordable past 7B model scale at 32K context — while MQA's single shared K/V head degraded quality noticeably on reasoning benchmarks. GQA interpolates between the two and lands on the empirically correct ratio (typically 4–8 KV heads per 32 query heads). It was formalized in Ainslie et al., 2023, shipped in production in Llama 2 and Mistral 7B, and refined in Llama 3 (2024) and Llama 4 (2025–2026).

hwLedger accounting gotcha. num_key_value_heads lives at the top level of config.json for HuggingFace-style configs but inside llama.attention.head_count_kv for GGUF metadata. hwledger-arch reads both; if you are writing a new classifier path, do not assume the HF name.

Memory footprint (32K context, 7B model)

Mistral 7B → Mistral-7B-Instruct-v0.2 with GQA (8 KV heads instead of 32):

KV cache reduction: 32/8 = 4x smaller
Old: 268 MB/layer × 32 layers = 8.6 GB
New: 67 MB/layer × 32 layers = 2.1 GB
Decode latency: minimal (same attention computation on fewer K, V vectors)

Which models use it

Mistral 7B Instruct v0.2 (8 KV heads, 32K context)
LLaMA 2-Chat (upgrade path over LLaMA 2)
Llama-3 (70B variant, 8 KV heads)
Phi-3 (Microsoft, 4K context, 32 heads, 8 KV groups)

hwLedger variant

AttentionKind::GQA { num_kv_heads } — stores explicit KV head count. Enables dynamic planning: fewer KV heads = smaller cache = longer context or batch size.

Worked example: 32K context

Model: Mistral-7B-Instruct-v0.2

Query heads: 32, KV heads: 8
KV cache per layer: 32K tokens × (256 + 256) × 2 bytes × 8 = 131 MB/layer
Full cache (32 layers): 4.2 GB
Speedup vs MHA: ~15-20% decode (reduced KV computation)

GQA vs MHA baseline (32K context, FP16)

Model	kv_heads	q_heads	KV/layer	Full cache (layers)	vs MHA
Llama-3-8B (GQA)	8	32	64 MiB	2.0 GiB (32L)	4× smaller
Llama-3-70B (GQA)	8	64	64 MiB	5.0 GiB (80L)	8× smaller
Llama-2-7B MHA baseline	32	32	256 MiB	8.0 GiB (32L)	1×

2026 citations

Ainslie et al., 2023: "GQA: Training Generalized Multi-Query Transformers" — introduces GQA framework
Chia et al., 2024: "Mistral 7B — production GQA deployment

Multi-Query Attention (MQA) — extreme compression
Llama-3 Analysis
Architecture Dispatch

Grouped Query Attention (GQA) ​

Formula ​

Why this variant ​

Memory footprint (32K context, 7B model) ​

Which models use it ​

hwLedger variant ​

Worked example: 32K context ​

GQA vs MHA baseline (32K context, FP16) ​

2026 citations ​

Related ​