AI ResearchKR

Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort

How microgpt.py's 15-line backward() works. From high school calculus to chain rule, computation graphs, topological sort, and backpropagation.

Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort

Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort

The backward() function in microgpt.py is 15 lines long. But these 15 lines are a complete implementation of the core algorithm that underpins all of deep learning -- backpropagation.

This post connects "why do we need topological sort?" and "what is the chain rule?" starting from high school calculus all the way to the backward() function in microgpt.py.

The Central Question of Deep Learning

Training a neural network means this:

  1. Feed an input and compute the output (forward pass)
  2. Measure how far the output is from the correct answer (loss)
  3. Compute how much each parameter contributed to the loss (gradient)
  4. Adjust each parameter slightly in the direction that reduces the loss (update)

Step 3 is the hard part. Whether there are 4,192 parameters (microgpt.py) or 70 billion (LLaMA), you need to compute "if I nudge this parameter slightly, how much does the loss change?" for each one.

🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts