Karpathy's microgpt.py Dissected: Understanding GPT's Essence in 150 Lines

Karpathy's microgpt.py Dissected: Understanding GPT's Essence in 150 Lines
Andrej Karpathy has released new code. This time, it is even more extreme than nanoGPT. A 150-line script that trains and runs inference on a GPT, using pure Python with no external libraries.
No PyTorch. No NumPy. Just three imports: os, math, random.
The comment at the top of the code says it all:
"This file is the complete algorithm. Everything else is just efficiency."
In this post, we dissect microgpt.py line by line. Follow along with the code, and you will see that the algorithm behind GPT is a surprisingly simple composition of mathematical operations.
Overall Structure
microgpt.py breaks down into roughly 6 parts:
Total parameters: 4,192. Compared to GPT-2 Small's 124M, that is roughly 30,000x smaller. But the algorithm is identical.
1. Data and Tokenizer
import os
import math
import random
random.seed(42)
if not os.path.exists('input.txt'):
import urllib.request
names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
urllib.request.urlretrieve(names_url, 'input.txt')
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)The dataset is names.txt from Karpathy's makemore project. It contains roughly 32,000 English names.
uchars = sorted(set(''.join(docs))) # ['a', 'b', ..., 'z'] -> 26글자
BOS = len(uchars) # BOS = 26
vocab_size = len(uchars) + 1 # 27The tokenizer operates at the character level. It maps a=0, b=1, ..., z=25, with index 26 reserved for the BOS (Beginning of Sequence) token.
An interesting design choice here: the BOS token also doubles as the EOS (End of Sequence) token:
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
# "emma" -> [26, 4, 12, 12, 0, 26]
# BOS e m m a EOSA single special token handles both start and end. The model distinguishes between the two roles by looking at context (position, preceding characters). This is in fact the same approach as GPT-2's <|endoftext|> token.
2. The Autograd Engine: Value Class
This is the heart of microgpt.py. It reimplements PyTorch's autograd in pure Python.
class Value:
__slots__ = ('data', 'grad', '_children', '_local_grads')
def __init__(self, data, children=(), local_grads=()):
self.data = data # 순방향 계산 값 (스칼라)
self.grad = 0 # 손실에 대한 이 노드의 기울기
self._children = children # 계산 그래프에서의 자식 노드들
self._local_grads = local_grads # 자식에 대한 국소 기울기Each Value object wraps a single scalar. Whenever an operation is performed, it automatically builds the computation graph.
__slots__ is a Python optimization. Instead of the usual __dict__, it stores attributes in a fixed-size array, saving 50-80 bytes per object. Since tens of thousands of Value objects are created in a single training step, this saving adds up.
Supported operations:
def __add__(self, other): # a + b -> 국소 기울기: (1, 1)
other = other if isinstance(other, Value) else Value(other)
return Value(self.data + other.data, (self, other), (1, 1))
def __mul__(self, other): # a * b -> 국소 기울기: (b, a)
other = other if isinstance(other, Value) else Value(other)
return Value(self.data * other.data, (self, other), (other.data, self.data))
def __pow__(self, other): # a^k -> 국소 기울기: k * a^(k-1)
return Value(self.data**other, (self,), (other * self.data**(other-1),))
def log(self): # log(a) -> 국소 기울기: 1/a
return Value(math.log(self.data), (self,), (1/self.data,))
def exp(self): # exp(a) -> 국소 기울기: exp(a)
return Value(math.exp(self.data), (self,), (math.exp(self.data),))
def relu(self): # max(0, a) -> 국소 기울기: 1 if a > 0 else 0
return Value(max(0, self.data), (self,), (float(self.data > 0),))These 6 primitive operations are sufficient to express every computation in GPT:
- linear layer: multiplication (mul) + addition (add)
- RMSNorm: squaring (pow) + mean (add, mul) + reciprocal (pow)
- softmax: exp + division (mul, pow)
- cross-entropy loss: log + negation (mul)
- ReLU: relu
The remaining operators (__neg__, __sub__, __truediv__, etc.) are all composed from these primitives.
Backpropagation:
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.gradBackpropagation proceeds in two stages:
- Traverse the computation graph via DFS to produce a topological sort.
- Walk through it in reverse order, applying the chain rule:
child.grad += local_grad * v.grad
If you are wondering "what exactly is a topological sort and why is it needed?" or "what is the chain rule, really?" -- there is a companion post that explains the math behind these 15 lines from the ground up: Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort.
The key insight: local_grad is a plain float stored during the forward pass. Backpropagation itself does not create new Value objects. This means second-order derivatives (Hessians) are not possible, but it saves significant memory. PyTorch's default behavior is identical.
3. Parameter Initialization
n_embd = 16 # 임베딩 차원
n_head = 4 # 어텐션 헤드 수
n_layer = 1 # 레이어 수
block_size = 16 # 최대 시퀀스 길이
head_dim = n_embd // n_head # 헤드당 차원 = 4
matrix = lambda nout, nin, std=0.08: [
[Value(random.gauss(0, std)) for _ in range(nin)]
for _ in range(nout)
]All weights are created via the matrix function. Each element of the 2D list is a Value object. Initial values are sampled from a Gaussian distribution with mean 0 and standard deviation 0.08.
state_dict = {
'wte': matrix(vocab_size, n_embd), # 토큰 임베딩: 27 x 16
'wpe': matrix(block_size, n_embd), # 위치 임베딩: 16 x 16
'lm_head': matrix(vocab_size, n_embd), # 출력 헤드: 27 x 16
}
for i in range(n_layer): # 1개 레이어
state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd) # 16 x 16
state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd) # 16 x 16
state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd) # 16 x 16
state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd) # 16 x 16
state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd) # 64 x 16
state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd) # 16 x 64Parameter count breakdown:
4,192 scalar Value objects. Each one exists as an independent Python object on the heap. GPT-2 Small's 124M parameters are stored as contiguous float16/float32 tensors in GPU memory. Same algorithm, dramatically different implementation.
4. Model Architecture
First, three utility functions:
def linear(x, w):
# 행렬-벡터 곱. x: list[Value], w: list[list[Value]]
return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
def softmax(logits):
max_val = max(val.data for val in logits) # 수치 안정성을 위한 max 빼기
exps = [(val - max_val).exp() for val in logits]
total = sum(exps)
return [e / total for e in exps]
def rmsnorm(x):
ms = sum(xi * xi for xi in x) / len(x) # 제곱 평균
scale = (ms + 1e-5) ** -0.5 # 1 / sqrt(ms + eps)
return [xi * scale for xi in x]linear is a matrix-vector product. It is equivalent to PyTorch's F.linear(x, W), but spelled out as scalar operations. For a 16-dimensional input producing a 16-dimensional output, 16 x (16 multiplications + 15 additions) = 496 Value objects are created.
rmsnorm implements RMSNorm from Zhang & Sennrich (2019). Unlike LayerNorm, it does not subtract the mean (no re-centering) and has no learnable parameters (gamma, beta). LLaMA, Gemma, and other recent models use RMSNorm but include a learnable gain parameter. microgpt.py omits even that.
The GPT function:
def gpt(token_id, pos_id, keys, values):
# 1. 임베딩
tok_emb = state_dict['wte'][token_id]
pos_emb = state_dict['wpe'][pos_id]
x = [t + p for t, p in zip(tok_emb, pos_emb)]
x = rmsnorm(x) # 임베딩 후 정규화 (GPT-2에는 없음)
for li in range(n_layer):
# 2. Multi-Head Attention
x_residual = x
x = rmsnorm(x) # Pre-norm
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
# KV-cache: 현재 토큰의 K, V를 기록
keys[li].append(k)
values[li].append(v)
x_attn = []
for h in range(n_head): # 4개 헤드, 각 4차원
hs = h * head_dim
q_h = q[hs:hs+head_dim]
k_h = [ki[hs:hs+head_dim] for ki in keys[li]] # 모든 과거 K
v_h = [vi[hs:hs+head_dim] for vi in values[li]] # 모든 과거 V
# Scaled dot-product attention
attn_logits = [
sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
for t in range(len(k_h))
]
attn_weights = softmax(attn_logits)
head_out = [
sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
for j in range(head_dim)
]
x_attn.extend(head_out)
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)] # Residual connection
# 3. MLP
x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1']) # 16 -> 64
x = [xi.relu() for xi in x] # ReLU (GPT-2는 GELU)
x = linear(x, state_dict[f'layer{li}.mlp_fc2']) # 64 -> 16
x = [a + b for a, b in zip(x, x_residual)] # Residual connection
# 4. 출력 logits
logits = linear(x, state_dict['lm_head'])
return logits # 27차원 벡터This function takes a single token and returns a probability distribution (logits) over the next token. It is structurally identical to GPT-2's forward pass.
Points worth noting:
KV-cache emerges naturally. The gpt() function processes one token at a time, appending K and V to lists. During attention computation, it references all previous K and V entries. This is exactly the same mechanism used for KV-cache in production LLM inference.
Causal masking is implicit. At position t, the keys list contains only entries from positions 0, 1, ..., t. No separate mask matrix is needed.
The KV-cache is used during training as well. Since all K and V values are Value objects, loss.backward() correctly backpropagates through the cached values. Mathematically, this is equivalent to processing the entire sequence in parallel.
5. Comparison with GPT-2
The absence of a final norm is the most practically significant difference. Both GPT-2 and LLaMA apply normalization right before the output projection. It is not an issue at this scale, but could affect training stability in larger models.
The lack of weight tying is also notable. wte (432 parameters) and lm_head (432 parameters) are separate. In GPT-2, these share the same weights. microgpt.py keeps them separate for simplicity.
6. Training Loop
learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params) # 1차 모멘트 (momentum)
v = [0.0] * len(params) # 2차 모멘트 (adaptive learning rate)
num_steps = 1000
for step in range(num_steps):
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses)
loss.backward()
lr_t = learning_rate * (1 - step / num_steps) # Linear decay
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0Batch size is 1. Each step processes a single name. No mini-batching, no gradient accumulation.
Cross-entropy loss is implemented manually: softmax(logits) followed by -log(p[target]). This is less numerically stable than the fused log_softmax used in production code (which combines exp and log into a single pass for better numerical stability), but it is not a problem at this scale.
The Adam optimizer follows the original Kingma & Ba (2015) paper. Bias correction (m_hat, v_hat) is included. beta1=0.85 is slightly lower than the standard 0.9, allowing the optimizer to respond more quickly to recent gradients in the noisy batch-size-1 setting.
The learning rate schedule is linear decay, starting from 0.01 and decreasing linearly to 0 over 1000 steps. Production models typically use warmup + cosine decay.
Here are the actual results from training for 500 steps. Loss starts at 3.2 and converges to around 2.4. Each step takes about 0.23 seconds -- slow because it is pure Python, but training clearly progresses.

7. Inference
temperature = 0.5
for sample_idx in range(20):
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
token_id = BOS
sample = []
for pos_id in range(block_size):
logits = gpt(token_id, pos_id, keys, values)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
if token_id == BOS:
break
sample.append(uchars[token_id])
print(f"sample {sample_idx+1:2d}: {''.join(sample)}")A temperature of 0.5 sharpens the distribution (more "conservative" than 1.0). Dividing logits by the temperature before applying softmax amplifies the probability of higher-ranked tokens.
random.choices performs weighted random sampling. It serves the same role as PyTorch's torch.multinomial.
When the BOS token (=26) is generated, the sequence terminates. This is where the BOS token's dual role as EOS becomes apparent.
Names generated at various temperatures:
- T=0.3: jamel, aneya, jailen, raryen, adara, kayri, alya, maya, arire, ara
- T=0.5: kolla, liylen, mavan, aikili, eara, karer, shane, mara, alema, amora
- T=0.8: maimel, risonen, faxuela, elyna, jaielev, coelenaimeea, harir
- T=1.0: taisar, luus, vasol, yynuev, fhazel, majazh, buryn
- T=1.5: rusonnodra, wnienln, ravelo, nnh, siclka, raarlr
Lower temperatures produce "safe" names (short, regular), while higher temperatures yield "experimental" ones (long, irregular).
8. Hidden Insights
Looking closely at microgpt.py reveals lessons that go beyond the code itself.
"Everything is scalar operations"
Everything that happens in GPT's forward pass -- embedding lookup, attention computation, MLP transformation -- ultimately reduces to scalar additions and multiplications. PyTorch's tensor operations simply bundle tens of thousands of these scalar operations and execute them in parallel.
When training on the name "emma" (5 positions), the forward pass creates roughly 39,700 Value objects. For "christopher" (12 positions), that climbs to about 99,500. Each object is 72-120 bytes, so a single step loads ~9MB of Python objects onto the heap. PyTorch processes the same computation in 1.89ms. microgpt.py takes 211.7ms. That is a 112x difference -- on CPU alone. With a GPU, the gap would be 10,000x or more.
After training, examining the distribution of each weight matrix reveals that meaningful structure has formed beyond the initialization (std=0.08). The lm_head (output head) has the widest distribution, because it needs a diverse range of values to distinguish between each token.

"KV-cache is a discovery, not an invention"
In microgpt.py, KV-cache is not a deliberately added optimization. It emerges naturally from processing tokens one at a time -- storing previous tokens' K and V just makes sense. KV-cache in production LLMs follows the same principle: do not recompute what has already been computed.
"Causal masking is a consequence, not a constraint"
Without any explicit mask matrix, only past tokens accumulate in the KV-cache, so causal attention arises naturally. Masking is a structural consequence of autoregressive generation, not an artificially imposed constraint.
Visualizing the attention patterns after training reveals that each of the 4 heads has learned a distinct pattern. Some heads focus on the immediately preceding character, while others reference the BOS token (sequence start). Even with just 4-dimensional heads, meaningful attention patterns emerge.

"Autograd is simpler than you think"
Applying the chain rule on a computation graph. Storing local gradients during the forward pass, then multiplying them through during the backward pass. This is the core of PyTorch, JAX, and TensorFlow alike. microgpt.py's backward() fits in 35 lines.
9. "Everything else is just efficiency"
What is in this code (= the algorithm):
- Token embedding + position embedding
- Multi-head causal self-attention (QKV projection + scaled dot-product)
- Feedforward network (expansion + nonlinearity + compression)
- Residual connection
- Normalization (RMSNorm)
- Autoregressive next-token prediction + cross-entropy loss
- Adam optimizer
What is not in this code (= efficiency optimizations):
None of these change the algorithm. The conceptual leap from microgpt.py to LLaMA-3 405B is zero. The engineering leap is enormous.
This is the real message of this code. GPT is a specific connection pattern of differentiable arithmetic operations (attention + MLP + residual), trained via gradient descent. It can be expressed in 150 lines of Python. Everything else is a matter of scale.
From micrograd to microgpt: Karpathy's Educational Philosophy
This code is the culmination of Karpathy's lineage of educational projects:
- micrograd (2020): Just the autograd engine. Backpropagation for neural networks, built from scratch.
- makemore (2022): Character-level language model. Gradual progression from bigram to Transformer.
- nanoGPT (2023): A trainable GPT-2. Depends on PyTorch, but with minimal code.
- microgpt (2025): Everything in one file. Autograd + model + training + inference in pure Python.
microgpt.py is the answer to the question "What if PyTorch didn't exist?" The answer: "You'd run the same algorithm, just 30,000x slower."
Key Takeaways
References
- Karpathy, "microgpt.py." GitHub Gist, 2025.
- Karpathy, "micrograd: A tiny scalar-valued autograd engine." GitHub, 2020.
- Karpathy, "nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs." GitHub, 2023.
- Radford et al., "Language Models are Unsupervised Multitask Learners." OpenAI, 2019.
- Kingma & Ba, "Adam: A Method for Stochastic Optimization." ICLR, 2015.
- Zhang & Sennrich, "Root Mean Square Layer Normalization." NeurIPS, 2019.