Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort

The backward() function in microgpt.py is 15 lines long. But these 15 lines are a complete implementation of the core algorithm that underpins all of deep learning -- backpropagation.

This post connects "why do we need topological sort?" and "what is the chain rule?" starting from high school calculus all the way to the backward() function in microgpt.py.

The Central Question of Deep Learning

Training a neural network means this:

Feed an input and compute the output (forward pass)
Measure how far the output is from the correct answer (loss)
Compute how much each parameter contributed to the loss (gradient)
Adjust each parameter slightly in the direction that reduces the loss (update)

Step 3 is the hard part. Whether there are 4,192 parameters (microgpt.py) or 70 billion (LLaMA), you need to compute "if I nudge this parameter slightly, how much does the loss change?" for each one.

The algorithm that does this efficiently is backpropagation. And the mathematical foundation of backpropagation is the chain rule.

Chain Rule: Differentiating Composite Functions

We start from high school calculus.

The derivative of f(x) = x^2 is f'(x) = 2x. When x is 3, f'(3) = 6. This means "if you increase x slightly from 3, f increases by roughly 6 times that amount."

But what happens when functions are nested?

Given y = f(g(x)), we want to know how much y changes when x changes.

The chain rule says:

dy/dx = (dy/dg) * (dg/dx)

In words: multiply "the rate at which y changes with respect to g" by "the rate at which g changes with respect to x" to get "the rate at which y changes with respect to x."

A concrete example:

g(x) = 3x + 1
f(g) = g^2

y = f(g(x)) = (3x + 1)^2

Applying the chain rule:

dg/dx = 3          (g changes with slope 3 relative to x)
df/dg = 2g         (f changes with slope 2g relative to g)

dy/dx = df/dg * dg/dx = 2g * 3 = 2(3x+1) * 3 = 6(3x+1)

When x = 1: g = 4, dy/dx = 6 * 4 = 24.

Verification: y(1) = 16, y(1.001) = (4.003)^2 = 16.024009. Change = 0.024009, rate of change = 24.009. Correct.

Why the Chain Rule Is Central to Deep Learning

A neural network is a chain of composite functions. Tracing the forward pass of microgpt.py:

입력 토큰
  -> 임베딩 조회 (행렬 인덱싱)
    -> 위치 임베딩 덧셈
      -> RMSNorm
        -> Q, K, V 선형 변환
          -> Attention score 계산
            -> Softmax
              -> 가중 합
                -> 출력 선형 변환
                  -> 잔차 연결
                    -> MLP (선형 -> ReLU -> 선형)
                      -> lm_head 선형 변환
                        -> Softmax
                          -> -log(정답 확률)
                            = loss

The loss is a deeply nested composite function of the parameters. To compute the gradient of a specific parameter w (dloss/dw), the chain rule must be applied repeatedly along this long chain.

Doing this by hand? You would need to re-trace the entire chain for each parameter. With 4,192 parameters, that means 4,192 times.

The key insight of backpropagation is: a single backward traversal can compute the gradients of all parameters simultaneously.

Computation Graphs: Visualizing Composite Functions

To apply the chain rule automatically, the computation must first be recorded as a graph.

The Value class in microgpt.py does exactly this. For example:

a = Value(2.0)
b = Value(3.0)
c = a * b        # c = 6.0, children: (a, b), local_grads: (3.0, 2.0)
d = c + a        # d = 8.0, children: (c, a), local_grads: (1, 1)

This produces the following graph:

a(2.0) ---*---> c(6.0) ---+---> d(8.0)
b(3.0) --/                |
a(2.0) -------------------/

Each node records two things:

Children: which values it was computed from
Local gradients (local_grads): the partial derivative with respect to each child

For multiplication c = a * b:

dc/da = b = 3.0   (nudge a slightly, c changes by a factor of b)
dc/db = a = 2.0   (nudge b slightly, c changes by a factor of a)

For addition d = c + a:

dd/dc = 1   (increase c by 1, d increases by 1)
dd/da = 1   (increase a by 1, d increases by 1)

These local gradients are computed immediately during the forward pass. Rather than applying a separate differentiation formula later, the gradient is stored at the moment the operation is performed.

Backpropagation: Walking the Graph Backwards

Now let us compute the gradient of a with respect to d (dd/da).

By the chain rule:

dd/da = dd/dc * dc/da + dd/da(direct)
      = 1 * 3 + 1
      = 4

Why are there two terms? Because a influences d through two paths:

a -> c -> d (through multiplication): contribution = 1 * 3 = 3
a -> d (direct addition): contribution = 1

The contributions from both paths are summed (gradient accumulation). This is why backpropagation uses child.grad += local_grad * v.grad with += rather than =.

Verification: d(2.0, 3.0) = 2*3 + 2 = 8. d(2.001, 3.0) = 2.001*3 + 2.001 = 8.004. Rate of change = 4. Correct.

Tracing the backpropagation order:

Step 0: d.grad = 1              (starting point: dd/dd = 1)

Step 1: propagate to d's children
  c.grad += 1 * 1 = 1          (local_grad=1, d.grad=1)
  a.grad += 1 * 1 = 1          (local_grad=1, d.grad=1)

Step 2: propagate to c's children
  a.grad += 3 * 1 = 3          (local_grad=3, c.grad=1)
  b.grad += 2 * 1 = 2          (local_grad=2, c.grad=1)

Result: a.grad = 1 + 3 = 4, b.grad = 2

This is all there is to backpropagation. Starting from the output, each node's gradient is propagated to its children by multiplying with the local gradient.

Topological Sort: Why It Is Necessary

In the example above, for backpropagation to work correctly, d must be processed before c. If we propagate to c's children (a, b) before c.grad is finalized, we propagate incorrect gradients.

The simple rule: to propagate a node's gradient to its children, all parent nodes pointing to that node must have already been processed.

This is what topological sort guarantees.

Topological sort means: given a directed acyclic graph (DAG), ordering the nodes in a linear sequence such that for every edge u -> v, u appears before v.

The implementation in microgpt.py:

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)   # 자식을 먼저 방문
            topo.append(v)          # 자식이 모두 처리된 후 자신을 추가
    build_topo(self)
    # topo = [a, b, c, d] (자식이 먼저, 부모가 나중)
    # reversed(topo) = [d, c, b, a] (부모가 먼저, 자식이 나중)

This is a post-order traversal of a DFS (depth-first search). The algorithm:

Mark the current node as visited
Recursively visit all child nodes first
After all children are done, append the current node to the list

Result: topo = [a, b, c, d]. Children first, parents last.

Reversing this gives: [d, c, b, a]. Parents first, children last. Propagating gradients in this order guarantees that every node's gradient is finalized before being passed to its children.

What Happens Without Topological Sort

What if we run backpropagation in an arbitrary order without topological sort?

Wrong order: [c, d, a, b]

Step 1: process c first
  c.grad = 0 (hasn't received anything from d yet!)
  a.grad += 3 * 0 = 0   (wrong value)
  b.grad += 2 * 0 = 0   (wrong value)

Step 2: process d
  c.grad += 1 * 1 = 1   (c.grad is set only now)
  a.grad += 1 * 1 = 1

Result: a.grad = 0 + 1 = 1 (correct answer is 4), b.grad = 0 (correct answer is 2)

Because c was processed before d, gradients were propagated to c's children while c.grad was still 0. The contribution from the d -> c path was completely lost.

Topological sort prevents this problem at its root. It guarantees the order "process children only after all parents have been processed."

A More Complex Example: Actual Computation in microgpt.py

Here is a simplified view of microgpt.py learning the name "ab":

tokens: [BOS, a, b, BOS]

Position 0: 입력=BOS, 정답=a
  emb = wte[26] + wpe[0]          # 임베딩 덧셈
  x = rmsnorm(emb)                 # 정규화
  q, k, v = linear(x, Wq/Wk/Wv)  # QKV 투영
  attn_out = attention(q, k, v)    # 어텐션 (자기 자신에게만)
  x = attn_out + emb               # 잔차 연결
  mlp_out = relu(linear(x, W1))    # MLP 전반부
  x = linear(mlp_out, W2) + x      # MLP 후반부 + 잔차
  logits = linear(x, lm_head)      # 출력 logits
  probs = softmax(logits)          # 확률 분포
  loss_0 = -log(probs[a])          # a의 확률에 대한 음의 로그

This computation creates roughly 7,000 Value objects. Across this computation graph of 7,000 nodes, the chain rule must be applied from loss_0 back to every parameter.

To compute the gradient of wte[26][0] (the first embedding dimension of the BOS token):

dloss/d(wte[26][0])
  = dloss/dprobs * dprobs/dlogits * dlogits/dx * dx/d(attn_out+emb)
    * d(attn_out)/dq * dq/dx * dx/d(emb) * demb/d(wte[26][0])

Computing this by hand makes mistakes inevitable. Automating it with topological sort and reverse traversal computes the gradients of all parameters exactly in a single pass through 7,000 nodes.

Time complexity: O(number of nodes). If the forward pass creates N nodes, the backward pass is also O(N). A single backward traversal suffices regardless of the number of parameters.

The Same Thing Happens in PyTorch

The Value class in microgpt.py operates on the same principle as PyTorch's autograd. The only difference is scale.

microgpt.py:

a = Value(2.0)
b = Value(3.0)
c = a * b
c.backward()  # a.grad = 3.0, b.grad = 2.0

PyTorch:

a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a * b
c.backward()   # a.grad = 3.0, b.grad = 2.0

The results are identical. Where PyTorch differs:

Tensors: operates on multidimensional arrays instead of scalars. A single matrix multiplication replaces thousands of scalar multiplications.
GPU kernels: tensor operations run in parallel across thousands of GPU cores.
Fused kernels: multiple operations are combined into a single kernel to reduce memory round-trips.
Graph optimizations: unnecessary intermediate nodes are eliminated or the computation order is rearranged.

But the core algorithm is the same: build computation graph -> topological sort -> apply chain rule in reverse.

Why Backpropagation? (Are There No Alternatives?)

There are other ways to compute gradients.

Numerical differentiation: nudge each parameter by a tiny amount and observe the change in loss.

dloss/dw ≈ (loss(w + epsilon) - loss(w)) / epsilon

The problem: if there are N parameters, you need N+1 forward passes. For microgpt.py's 4,192 parameters, that means 4,193 forward passes. For GPT-2's 124M parameters, 124 million forward passes. Infeasible.

Forward-mode AD: applies the chain rule in the forward direction. Computes the gradient of all intermediate values with respect to a single input parameter in one forward pass.

The problem: you can only compute gradients with respect to one parameter at a time. N parameters means N forward passes.

Reverse-mode AD (backpropagation): applies the chain rule in the reverse direction. Computes the gradients of all parameters with respect to a single output (loss) in one backward pass.

In deep learning, there is one output (loss) and many inputs (parameters), so the reverse direction is overwhelmingly more efficient. One forward pass + one backward pass = all gradients.

This is why backpropagation is the cornerstone of deep learning. Whether there are 10 parameters or 10 billion, one forward pass + one backward pass is all it takes.

Revisiting backward() in microgpt.py

Now we return to the original code:

def backward(self):
    # 1. 위상 정렬: 자식 -> 부모 순서
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    # 2. 시드 설정: dloss/dloss = 1
    self.grad = 1

    # 3. 역순 순회: 부모 -> 자식 순서로 chain rule 적용
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad

What each part means:

build_topo: creates the topological sort via DFS post-order traversal. Children are added to the list first, parents last.

self.grad = 1: the starting point of backpropagation. It begins from the self-evident fact that "the gradient of loss with respect to loss is 1."

reversed(topo): reverses the topological sort to traverse in parent -> child order. This guarantees that every node's gradient is finalized before being passed to its children.

child.grad += local_grad * v.grad: the core of the chain rule. local_grad is "the local gradient of this operation" (precomputed during the forward pass), and v.grad is "the accumulated gradient from loss to this node" (propagated during the backward pass). Multiplying these gives "the gradient contribution from loss to the child." The += accumulates contributions from multiple paths.

15 lines. This is everything that makes learning in deep learning possible.

Key Takeaways

References

Karpathy, "microgpt.py." GitHub Gist, 2025.
Karpathy, "micrograd: A tiny scalar-valued autograd engine." GitHub, 2020.
Rumelhart, Hinton, Williams, "Learning representations by back-propagating errors." Nature, 1986.
Baydin et al., "Automatic Differentiation in Machine Learning: a Survey." JMLR, 2018.