Karpathy's microgpt.py Dissected: Understanding GPT's Essence in 150 Lines
A line-by-line dissection of microgpt.py -- a pure Python GPT implementation with zero dependencies. Training, inference, and autograd in 150 lines.

Karpathy's microgpt.py Dissected: Understanding GPT's Essence in 150 Lines
Andrej Karpathy has released new code. This time, it is even more extreme than nanoGPT. A 150-line script that trains and runs inference on a GPT, using pure Python with no external libraries.
No PyTorch. No NumPy. Just three imports: os, math, random.
The comment at the top of the code says it all:
"This file is the complete algorithm. Everything else is just efficiency."
In this post, we dissect microgpt.py line by line. Follow along with the code, and you will see that the algorithm behind GPT is a surprisingly simple composition of mathematical operations.
Overall Structure
microgpt.py breaks down into roughly 6 parts:
| Part | Lines | Role |
|---|---|---|
| Data & Tokenizer | ~10 | Load name dataset, character-level tokenization |
| Value Class (Autograd) | ~35 | Scalar automatic differentiation engine |
| Parameter Initialization | ~15 | Weight matrix creation (4,192 parameters) |
| Model Architecture | ~40 | Embedding + Attention + MLP + RMSNorm |
| Training Loop | ~20 | Cross-entropy loss + Adam optimizer |
| Inference | ~15 | Name generation via temperature sampling |
Total parameters: 4,192. Compared to GPT-2 Small's 124M, that is roughly 30,000x smaller. But the algorithm is identical.
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.