LLM Inference Optimization Part 3 — Sparse Attention in Practice

In Part 2, we covered KV Cache quantization, compression, and PagedAttention. Those techniques focus on reducing stored data. Part 3 shifts direction and tackles reducing the computation itself with Sparse Attention.

The key question: "Do we really need every token?"

In most cases, the answer is no. In a 128K context, the tokens that the current token actually needs to attend to are only 5–20% of the total.

The Problem with Full Attention

Standard Self-Attention computes the relationship between the current token and every previous token.

python

import torch
import matplotlib.pyplot as plt
import numpy as np

def visualize_attention_pattern(seq_len=32):
    """Full Attention 패턴 시각화"""
    # Causal mask: 하삼각 행렬
    mask = torch.tril(torch.ones(seq_len, seq_len))

    fig, ax = plt.subplots(figsize=(8, 8))
    ax.imshow(mask, cmap='Blues', aspect='equal')
    ax.set_xlabel('Key Position')
    ax.set_ylabel('Query Position')
    ax.set_title(f'Full Causal Attention ({seq_len}×{seq_len})')

    # 계산량 표시
    total_ops = mask.sum().item()
    ax.text(seq_len//2, -2, f'Total attention pairs: {int(total_ops)}',
            ha='center', fontsize=11)

    plt.tight_layout()
    plt.savefig('full_attention_pattern.png', dpi=150)
    plt.show()

visualize_attention_pattern()

The number of attention pairs for a 128K sequence: $128K \times 128K / 2 = 8.2B$ operations. In practice, most attention weights are close to zero — only a handful of "important" tokens receive high weights.

LLM Inference Optimization Part 3 — Sparse Attention in Practice

LLM Inference Optimization Part 3 — Sparse Attention in Practice

The Problem with Full Attention

Sliding Window Attention

How It Works

Sign in to continue reading

Related Posts

Self-Evolving AI Agents — The New Paradigm of 2026

Build Your Own LLM Knowledge Base — A Karpathy-Style Knowledge System

Why Karpathy's CLAUDE.md Got 48K Stars — And How to Write Your Own