Karpathy 2022 — Micrograd: Building Backpropagation from Scratch
Source type: Lecture transcript (YouTube: Neural Networks: Zero to Hero)
Author: andrej-karpathy
Year: 2022
Raw file: raw/articles/Neural_network.docx
Summary
andrej-karpathy — then at OpenAI, now independent — walks through building micrograd, a scalar-valued automatic-differentiation engine he released on GitHub approximately two years prior to the lecture. The lecture starts from a blank Jupyter notebook and ends with a fully functional multi-layer-perceptron trained by backpropagation.
The central claim is that micrograd is all you need to understand neural network training. Everything else — PyTorch, JAX — is efficiency on top of the same mathematics.
Key Ideas
The Value Object and Expression Graphs
The building block of micrograd is a Value object that wraps a scalar number. When you add, multiply, or apply functions to Value objects, micrograd silently builds a computational-graph tracking every operation and its children. This graph records not just the result of the computation (forward pass) but also enough information to run backpropagation in reverse.
Forward and Backward Pass
During the forward pass, the expression graph is evaluated left-to-right to produce an output (e.g., a loss value g). During the backward pass, calling .backward() on g triggers recursive application of the chain-rule through the graph. Each node accumulates .grad — the derivative of the final output with respect to that node. For inputs a and b, .grad tells you exactly how much g changes per unit nudge.
Chain Rule and Local Gradients
Each operation stores its own local gradient — the derivative of its output with respect to its input. backpropagation multiplies local gradients by the incoming gradient from downstream (out.grad) and accumulates the result into each input’s .grad. The process is:
input.grad += local_derivative × out.grad
Key local derivatives implemented in micrograd:
- Addition: both inputs receive
out.grad × 1 - Multiplication: each input receives
out.grad × other.data - tanh: input receives
out.grad × (1 - t²)wheret = tanh(x)
Topological Sort for Correct Ordering
A naive implementation calls _backward() in the wrong order. The fix is topological sort: a depth-first traversal starting at the root (loss node) that ensures every node is processed only after all nodes that depend on it have been processed first. Micrograd builds this list by recursively visiting children before appending self.
Neural Networks as Mathematical Expressions
A key insight of the lecture: neural networks are just mathematical expressions. They take input data and weights as inputs, compute predictions through layered matrix multiplications and nonlinearities, and produce a loss. backpropagation is indifferent to whether the expression is a neural network or any other computation — it simply propagates gradients backward through an arbitrary graph.
Scalars vs. Tensors
Micrograd operates on individual scalars for pedagogical clarity. Production libraries (pytorch, JAX) package scalars into tensors — multi-dimensional arrays — so that hardware parallelism can be exploited. The mathematics is identical; tensors are purely an efficiency mechanism.
The MLP Architecture
Built on top of micrograd’s autograd engine is a minimal multi-layer-perceptron library (nn.py, ~50 lines):
- Neuron: dot product of inputs and weights, plus bias, passed through tanh
- Layer: list of neurons operating in parallel
- MLP: sequence of layers with the output of each feeding the next
The entire neural network library sits atop ~100 lines of autograd code, demonstrating that the conceptual surface area of deep learning is small.
Concepts Introduced
| Concept | Description |
|---|---|
| backpropagation | Recursive application of chain rule through a computational graph |
| automatic-differentiation | Software technique for computing exact derivatives of arbitrary programs |
| computational-graph | DAG of operations whose edges represent data dependencies |
| chain-rule | Calculus rule for composing derivatives through function composition |
| loss-function | Scalar metric measuring prediction error; the root node of the backward pass |
| gradient-descent | Iterative weight update by subtracting gradient scaled by a learning rate |
| multi-layer-perceptron | Stack of fully connected layers with nonlinear activations; the simplest deep network |
| neural-network | Mathematical expression mapping inputs to outputs via learnable weights |
Entities Mentioned
| Entity | Role |
|---|---|
| andrej-karpathy | Author / lecturer; creator of micrograd |
| micrograd | Scalar autograd engine built in the lecture |
| pytorch | Production deep learning library cited as the “production version” of micrograd |