Multi-Head Latent Attention (MLA)

Compresses KV cache by projecting to a low-rank latent space before multi-head operation.

Formula

Project keys and values to latent dimension d_latent << d_model:

K_{latent} = K W^{K} \in R^{batch \times context \times d_{latent}}

V_{latent} = V W^{V} \in R^{batch \times context \times d_{latent}}

Then apply standard multi-head attention on latent space:

MLA (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

Benefit: KV cache is d_latent-sized instead of d_model-sized.

Why this variant

MLA was the DeepSeek team's answer to the specific problem that even GQA's 8× compression left long-context (>100K) inference infeasible on commodity hardware for models in the 200B+ parameter class. By projecting into a latent space before splitting into heads, MLA stores a single kv_lora_rank-sized tensor per token instead of per-head K and V tensors — a 10–16× reduction over GQA at equivalent quality. It was introduced in DeepSeek-V2 (2024) and productionized in DeepSeek-V3 (2024–2025) and DeepSeek-R1 (2025). The technique is also the basis for Qwen's latent variants.

hwLedger accounting gotcha. MLA's KV footprint is 2 * kv_lora_rank * bytes per token per layer — NOT 2 * num_kv_heads * head_dim * bytes. A naive reuse of the GQA formula overstates memory by ~10× for DeepSeek-V3. AttentionKind::MLA { latent_dim } carries the latent dim explicitly; the planner will refuse to produce a result if latent_dim is missing rather than silently fall back to GQA math.

Memory footprint (32K context, 7B model)

DeepSeek-V2 with MLA (d_latent = 256 vs d_model = 4096):

KV cache per layer: 32K × 256 × 2 = 16.4 MB/layer
Full cache: 524 MB for 32 layers
Savings: 16x vs standard MHA

Which models use it

DeepSeek-V2 (128K context, latent-only KV)
Qwen2.5-32B (rotating latent projections)

MLA is production-proven for ultra-long context models (100K+).

hwLedger variant

AttentionKind::MLA { latent_dim } — stores latent dimension for dynamic planning. Enables longest context windows on memory-constrained hardware.

Worked example: 32K context

DeepSeek-V2 (176B mixture-of-experts, d_latent=256):

KV cache all layers: 32K × 256 × 2 × 60 (layers) = 983 MB
Decode batch size: 64 tokens simultaneously
Total memory with model weights: ~50 GB (vs 100+ for standard attention)

MLA vs MHA baseline (DeepSeek-V3, 32K context, FP16)

Model	kv_lora_rank	layers	KV/layer	Full cache	vs MHA baseline
DeepSeek-V2 MLA	256	60	16 MiB	960 MiB	~16× smaller
DeepSeek-V3 MLA	512	61	32 MiB	~1.9 GiB	~8× smaller
DeepSeek-V3 as-if-MHA (hypothetical)	—	61	~256 MiB	~15 GiB	1×

2026 citations

DeepSeek-V2 Technical Report — production MLA with mixture-of-experts

Multi-Head Latent Attention (MLA) ​

Formula ​

Why this variant ​

Memory footprint (32K context, 7B model) ​

Which models use it ​

hwLedger variant ​

Worked example: 32K context ​

MLA vs MHA baseline (DeepSeek-V3, 32K context, FP16) ​

2026 citations ​

Related ​