LLM Inference Optimization Part 3 — Sparse Attention in Practice
Sliding Window, Sink Attention, DeepSeek DSA, IndexCache, and Nvidia DMS. From dynamic token selection to Needle-in-a-Haystack evaluation.

LLM Inference Optimization Part 3 — Sparse Attention in Practice
In Part 2, we covered KV Cache quantization, compression, and PagedAttention. Those techniques focus on reducing stored data. Part 3 shifts direction and tackles reducing the computation itself with Sparse Attention.
The key question: "Do we really need every token?"
In most cases, the answer is no. In a 128K context, the tokens that the current token actually needs to attend to are only 5–20% of the total.
The Problem with Full Attention
Related Posts

LLM Inference Optimization Part 4 — Production Serving
Production deployment with vLLM and TGI. Continuous Batching, Speculative Decoding, memory budget design, and throughput benchmarks.

LLM Inference Optimization Part 2 — KV Cache Optimization
KV Cache quantization (int8/int4), PCA compression (KVTC), and PagedAttention (vLLM). Hands-on memory reduction code and scenario-based configuration guide.

LLM Inference Optimization Part 1 — Attention Mechanism Deep Dive
Build Self-Attention from scratch. Compare MHA → GQA → MQA evolution in code. KV Cache mechanics and Prefill vs Decode analysis.