Skip to content

Sliding Window Attention

Restricts attention to a local context window (e.g., 4K tokens) while allowing information to flow globally via stacking.

Formula

Each token attends only to the last w tokens (sliding window width):

Attention(Qi,K,V)=softmax(QiK[iw:i]dk)V[iw:i]

Global information: repeated layers allow exponential reach. Layer l has receptive field ~2^l.

Why this variant

Sliding window attention addresses MHA's O(n²) per-layer attention matrix, which is the dominant cost for long-context prefill. By capping each token's attention span to a fixed window (e.g., 4K), prefill becomes linear in sequence length while layer stacking preserves long-range dependency reach exponentially. Introduced in Beltagy et al., 2020, Longformer and BigBird (Zaheer et al., 2020); productionized in Mistral 7B (Chia et al., 2023) and refined in Mistral-NeMo (2024) and Gemma 2/3 (Google, 2024–2025).

hwLedger accounting gotcha. A sliding-window model's KV cache does not grow beyond window_size tokens — but prefill memory still peaks at min(context, window) × num_layers. hwLedger's planner reports both decode-steady-state and prefill-peak; reviewers who look only at the steady-state number miss the 4× spike during initial prompt processing.

Memory footprint (32K context, 7B model)

Mistral 7B (4K window, 32 layers):

  • KV cache per layer: 4K × 4096 × 2 (FP16) = 32 MB/layer
  • Full cache: 1 GB for 32 layers
  • Savings: 8x vs 32K full-attention
  • Effective context: 2^32 due to layer stacking

Which models use it

  • Mistral 7B (4K window, 32K context via rotary embeddings)
  • Llama-2-70B Chat (variants, 4K sliding window)
  • Bloom (1.5K window by design)

Allows long-context inference on limited hardware.

hwLedger variant

AttentionKind::SlidingWindow { window_size } — dynamic window tuning. Smaller window = shorter context but more memory-efficient.

Worked example: 32K context

Mistral 7B with 4K sliding window:

  • Inference latency: O(4K) per token (not 32K)
  • KV cache: only last 4K tokens stored = 31 MB/layer
  • Batch 16: 16 × 32 layers × 32 MB = 16 GB
  • Trade-off: excellent for long documents, worse for cross-document reasoning

Sliding-window vs MHA baseline (FP16)

Modelwindowkv_headseffective KV/layerFull cache (32L)
Mistral 7B SWA409688 MiB256 MiB
Mistral-NeMo 12B SWA4096816 MiB640 MiB
Llama-2-7B MHA baseline (32K)32256 MiB8.0 GiB

2026 citations

Released under the Apache 2.0 License.