Backpropagation
Backpropagation (short for “backward propagation of errors”) is the algorithm at the mathematical core of every modern deep learning library. It computes the gradient of a loss-function with respect to every weight in a neural-network by traversing the computational-graph in reverse and applying the chain-rule at each node.
What It Does
Given a scalar output L (the loss), backpropagation answers: “By how much does L change if I nudge weight w by a tiny amount?” The answer is ∂L/∂w, also written w.grad. Once these gradients are known, gradient-descent can update every weight in the direction that reduces L.
The Algorithm
Step 1 — Forward Pass
Evaluate the network’s mathematical expression from inputs to loss, recording every intermediate value in the computational-graph.
Step 2 — Seed the Root
Set L.grad = 1.0. This initialises the chain: the derivative of L with respect to itself is 1.
Step 3 — Topological Sort
Order all nodes so that every node appears before the nodes that depend on it. This guarantees each node’s incoming gradient is fully accumulated before it propagates further backward.
Step 4 — Apply Local Gradients
For each node out = f(a, b) in reverse topological order:
a.grad += (∂out/∂a) × out.grad
b.grad += (∂out/∂b) × out.grad
∂out/∂a is the local gradient — the derivative of this one operation in isolation. out.grad is the already-computed gradient flowing in from downstream. Their product, added to a.grad, implements the chain-rule.
Common Local Gradients
| Operation | Local gradient for input a |
|---|---|
out = a + b | 1 |
out = a × b | b |
out = a^n | n × a^(n-1) |
out = tanh(a) | 1 − tanh(a)² |
out = exp(a) | exp(a) |
Why Topological Sort Matters
If node C depends on A and B, C’s gradient must be computed before A’s and B’s. Topological sort (depth-first, appending self after all children) produces an ordering where the gradient is always fully formed before it is used.
Backpropagation vs. Autograd
Backpropagation is the mathematical algorithm; automatic-differentiation is the software technique that implements it. Libraries like micrograd, pytorch, and JAX record operations during the forward pass and replay them in reverse to execute backpropagation automatically.
Relationship to Neural Networks
neural-networks are just computational-graphs with a specific structure (layers of matrix multiplications and nonlinearities). Backpropagation applies to any differentiable computation — the network architecture does not change the algorithm. This generality is what makes autograd engines reusable across architectures.
Sources
- karpathy-2022-micrograd-backpropagation — primary source; step-by-step implementation in micrograd