LLM Inference Optimization Part 2 — KV Cache Optimization
KV Cache quantization (int8/int4), PCA compression (KVTC), and PagedAttention (vLLM). Hands-on memory reduction code and scenario-based configuration guide.

LLM Inference Optimization Part 2 — KV Cache Optimization
In Part 1, we covered the structure of Attention and how KV Cache works. In this part, we look at practical techniques for optimizing the KV Cache itself, with code.
Even when model weights are reduced through quantization, KV Cache is almost always left in fp16. As context length grows, it is common for KV Cache to consume more than half of total VRAM. We cover three approaches to solving this problem.
1. KV Cache Quantization
How It Works
Related Posts

LLM Inference Optimization Part 4 — Production Serving
Production deployment with vLLM and TGI. Continuous Batching, Speculative Decoding, memory budget design, and throughput benchmarks.

LLM Inference Optimization Part 3 — Sparse Attention in Practice
Sliding Window, Sink Attention, DeepSeek DSA, IndexCache, and Nvidia DMS. From dynamic token selection to Needle-in-a-Haystack evaluation.

LLM Inference Optimization Part 1 — Attention Mechanism Deep Dive
Build Self-Attention from scratch. Compare MHA → GQA → MQA evolution in code. KV Cache mechanics and Prefill vs Decode analysis.