Multi-Head Latent Attention (MLA)
Compresses KV cache by projecting to a low-rank latent space before multi-head operation.
Formula
Project keys and values to latent dimension d_latent << d_model:
Then apply standard multi-head attention on latent space:
Benefit: KV cache is d_latent-sized instead of d_model-sized.
Why this variant
MLA was the DeepSeek team's answer to the specific problem that even GQA's 8× compression left long-context (>100K) inference infeasible on commodity hardware for models in the 200B+ parameter class. By projecting into a latent space before splitting into heads, MLA stores a single kv_lora_rank-sized tensor per token instead of per-head K and V tensors — a 10–16× reduction over GQA at equivalent quality. It was introduced in DeepSeek-V2 (2024) and productionized in DeepSeek-V3 (2024–2025) and DeepSeek-R1 (2025). The technique is also the basis for Qwen's latent variants.
hwLedger accounting gotcha. MLA's KV footprint is 2 * kv_lora_rank * bytes per token per layer — NOT 2 * num_kv_heads * head_dim * bytes. A naive reuse of the GQA formula overstates memory by ~10× for DeepSeek-V3. AttentionKind::MLA { latent_dim } carries the latent dim explicitly; the planner will refuse to produce a result if latent_dim is missing rather than silently fall back to GQA math.
Memory footprint (32K context, 7B model)
DeepSeek-V2 with MLA (d_latent = 256 vs d_model = 4096):
- KV cache per layer: 32K × 256 × 2 = 16.4 MB/layer
- Full cache: 524 MB for 32 layers
- Savings: 16x vs standard MHA
Which models use it
- DeepSeek-V2 (128K context, latent-only KV)
- Qwen2.5-32B (rotating latent projections)
MLA is production-proven for ultra-long context models (100K+).
hwLedger variant
AttentionKind::MLA { latent_dim } — stores latent dimension for dynamic planning. Enables longest context windows on memory-constrained hardware.
Worked example: 32K context
DeepSeek-V2 (176B mixture-of-experts, d_latent=256):
- KV cache all layers: 32K × 256 × 2 × 60 (layers) = 983 MB
- Decode batch size: 64 tokens simultaneously
- Total memory with model weights: ~50 GB (vs 100+ for standard attention)
MLA vs MHA baseline (DeepSeek-V3, 32K context, FP16)
| Model | kv_lora_rank | layers | KV/layer | Full cache | vs MHA baseline |
|---|---|---|---|---|---|
| DeepSeek-V2 MLA | 256 | 60 | 16 MiB | 960 MiB | ~16× smaller |
| DeepSeek-V3 MLA | 512 | 61 | 32 MiB | ~1.9 GiB | ~8× smaller |
| DeepSeek-V3 as-if-MHA (hypothetical) | — | 61 | ~256 MiB | ~15 GiB | 1× |
2026 citations
- DeepSeek-V2 Technical Report — production MLA with mixture-of-experts