Attention Sink Phenomenon
As context length increases beyond training data, attention weights "sink" (collapse) to early tokens (BOS, padding, first few tokens). Destabilizes long-context inference.
Why this variant
"Attention sink" is not a new attention mechanism — it is the recognition that every variant above inherits a failure mode when extrapolated beyond its training context length. The weakness it addresses is RoPE extrapolation: positional encodings trained at 4K emit anomalously high logits for tokens at positions 0-100 when you ask the model to attend over 32K tokens, collapsing the softmax. The StreamingLLM (Xiao et al., 2023) paper formalized the phenomenon and proposed sink-token preservation; NTK-aware RoPE scaling (Peng et al., 2023, YaRN) and LongRoPE (Ding et al., 2024) addressed the encoding side; DeepSeek-V2's ALiBi hybrid (2024) sidesteps it entirely.
hwLedger accounting gotcha. hwLedger's planner refuses to report confidence above 0.8 when requested context_length > trained_context. This is the only place in the stack where the planner intentionally returns an under-confident result instead of a hard error — long-context inference with RoPE interpolation is empirically quality-degraded but not catastrophic, so a refusal would be wrong. The confidence number is meant to prompt the user to rerun the probe with a smaller context.
Root cause
Extrapolation: Positional embeddings (RoPE, Alibi, T5-style) assume maximum context is ~4K. Beyond training, embeddings don't generalize.
Attention entropy: Attention weights should be distributed. But OOD context causes softmax to either:
- Collapse to early tokens (attention sink)
- Become uniform (no meaningful attention)
Mathematical model
For token i at position >> training context:
Beyond training, first few positions j have anomalously high logits a_j, causing α → 0 for mid/late tokens.
Impact (32K context, 7B model)
- Training context: 4K
- Inference at 32K: attention sinks heavily to positions 0-100
- Result: effective context ~2K (model ignores middle 30K tokens)
- Quality: ~20% reduction in accuracy on long-document tasks
Mitigations
Rotary position interpolation (NTK-aware scaling, Mistral):
Allows 2-4x context extension without catastrophic failure.
Attention sink masking (Li et al.): Preserve attention to first few tokens (let them be sinks) while preventing them from blocking other tokens.
Continued pre-training (DeepSeek, Llama-3): Train for 2-5B more tokens with 32K context, fine-tuning position encodings.
Which models address it
- Mistral 7B (RoPE interpolation, 4K→32K)
- Llama-3 (rotary improvements, trained to 8K)
- DeepSeek-V2 (ALiBi + MLA, 128K native)
hwLedger handling
AttentionKind::* with context_cap — limits inference context to training maximum when attention sinks are detected (via loss spike monitoring).
Worked example: 32K context
Mistral 7B (trained 4K, inferred 32K):
- Without mitigation: effective context ~2K (sinking)
- With RoPE interpolation: effective context ~24K
- Quality: 90% of 4K-trained performance
Sink mitigation vs baseline (quality retention at 4× training context)
| Model / technique | training ctx | inference ctx | effective ctx | quality vs trained |
|---|---|---|---|---|
| Mistral 7B baseline | 4K | 32K | ~2K (sinking) | ~50% |
| Mistral 7B + NTK RoPE | 4K | 32K | ~24K | ~90% |
| Llama-3 (trained at 8K, extended to 128K) | 8K | 128K | ~96K | ~88% |
| DeepSeek-V2 (ALiBi + MLA, native 128K) | 128K | 128K | 128K | baseline |
2026 citations
- Su et al., 2021: "RoFormer: Enhanced Transformer with Rotary Position Embedding" — RoPE foundation
- Li et al., 2023: "Transformers are Capable of Learning Arbitrary Attention Mechanisms" — attention sink analysis