Skip to content

Multi-Head Latent Attention (MLA)

Compresses KV cache by projecting to a low-rank latent space before multi-head operation.

Formula

Project keys and values to latent dimension d_latent << d_model:

Klatent=KWKRbatch×context×dlatentVlatent=VWVRbatch×context×dlatent

Then apply standard multi-head attention on latent space:

MLA(Q,K,V)=Concat(head1,,headh)WO

Benefit: KV cache is d_latent-sized instead of d_model-sized.

Why this variant

MLA was the DeepSeek team's answer to the specific problem that even GQA's 8× compression left long-context (>100K) inference infeasible on commodity hardware for models in the 200B+ parameter class. By projecting into a latent space before splitting into heads, MLA stores a single kv_lora_rank-sized tensor per token instead of per-head K and V tensors — a 10–16× reduction over GQA at equivalent quality. It was introduced in DeepSeek-V2 (2024) and productionized in DeepSeek-V3 (2024–2025) and DeepSeek-R1 (2025). The technique is also the basis for Qwen's latent variants.

hwLedger accounting gotcha. MLA's KV footprint is 2 * kv_lora_rank * bytes per token per layer — NOT 2 * num_kv_heads * head_dim * bytes. A naive reuse of the GQA formula overstates memory by ~10× for DeepSeek-V3. AttentionKind::MLA { latent_dim } carries the latent dim explicitly; the planner will refuse to produce a result if latent_dim is missing rather than silently fall back to GQA math.

Memory footprint (32K context, 7B model)

DeepSeek-V2 with MLA (d_latent = 256 vs d_model = 4096):

  • KV cache per layer: 32K × 256 × 2 = 16.4 MB/layer
  • Full cache: 524 MB for 32 layers
  • Savings: 16x vs standard MHA

Which models use it

  • DeepSeek-V2 (128K context, latent-only KV)
  • Qwen2.5-32B (rotating latent projections)

MLA is production-proven for ultra-long context models (100K+).

hwLedger variant

AttentionKind::MLA { latent_dim } — stores latent dimension for dynamic planning. Enables longest context windows on memory-constrained hardware.

Worked example: 32K context

DeepSeek-V2 (176B mixture-of-experts, d_latent=256):

  • KV cache all layers: 32K × 256 × 2 × 60 (layers) = 983 MB
  • Decode batch size: 64 tokens simultaneously
  • Total memory with model weights: ~50 GB (vs 100+ for standard attention)

MLA vs MHA baseline (DeepSeek-V3, 32K context, FP16)

Modelkv_lora_ranklayersKV/layerFull cachevs MHA baseline
DeepSeek-V2 MLA2566016 MiB960 MiB~16× smaller
DeepSeek-V3 MLA5126132 MiB~1.9 GiB~8× smaller
DeepSeek-V3 as-if-MHA (hypothetical)61~256 MiB~15 GiB

2026 citations

Released under the Apache 2.0 License.