LLM Inference Optimization Part 2 — KV Cache Optimization

In Part 1, we covered the structure of Attention and how KV Cache works. In this part, we look at practical techniques for optimizing the KV Cache itself, with code.

Even when model weights are reduced through quantization, KV Cache is almost always left in fp16. As context length grows, it is common for KV Cache to consume more than half of total VRAM. We cover three approaches to solving this problem.

1. KV Cache Quantization

How It Works

Each element of the KV Cache is converted from fp16 (2 bytes) to int8 (1 byte) or int4 (0.5 bytes). This is the most straightforward compression method.

$$\text{quantized} = \text{round}\left(\frac{x - \text{zero\_point}}{\text{scale}}\right)$$

python

import torch

def quantize_kv_cache_int8(key: torch.Tensor, value: torch.Tensor):
    """Quantize KV Cache to int8 (per-channel)"""
    def quantize_tensor(t):
        # Per-channel quantization based on min/max
        t_flat = t.reshape(-1, t.shape[-1])  # (tokens, head_dim)
        t_min = t_flat.min(dim=0).values
        t_max = t_flat.max(dim=0).values

        scale = (t_max - t_min) / 255.0
        zero_point = t_min

        quantized = ((t - zero_point) / scale).round().clamp(0, 255).to(torch.uint8)
        return quantized, scale, zero_point

    k_quant, k_scale, k_zp = quantize_tensor(key)
    v_quant, v_scale, v_zp = quantize_tensor(value)

    return (k_quant, k_scale, k_zp), (v_quant, v_scale, v_zp)


def dequantize_kv_cache(quantized, scale, zero_point, dtype=torch.float16):
    """Dequantize KV Cache back to original precision"""
    return quantized.to(dtype) * scale + zero_point


# Usage example
batch, kv_heads, seq_len, head_dim = 1, 8, 4096, 128
key = torch.randn(batch, kv_heads, seq_len, head_dim, dtype=torch.float16)
value = torch.randn(batch, kv_heads, seq_len, head_dim, dtype=torch.float16)

# Quantize
(k_q, k_s, k_z), (v_q, v_s, v_z) = quantize_kv_cache_int8(key, value)

# Memory comparison
original_mb = (key.nelement() + value.nelement()) * 2 / 1024**2  # fp16
quantized_mb = (k_q.nelement() + v_q.nelement()) * 1 / 1024**2   # int8

print(f"Original (fp16): {original_mb:.1f} MB")
print(f"Quantized (int8): {quantized_mb:.1f} MB")
print(f"Compression: {original_mb / quantized_mb:.1f}x")

# Check reconstruction quality
key_restored = dequantize_kv_cache(k_q, k_s, k_z)
mse = ((key.float() - key_restored.float()) ** 2).mean()
print(f"MSE: {mse:.6f}")

KV Cache Quantization in HuggingFace

transformers 4.38+ supports QuantizedCache.

LLM Inference Optimization Part 2 — KV Cache Optimization