Backpropagation

Backpropagation (short for “backward propagation of errors”) is the algorithm at the mathematical core of every modern deep learning library. It computes the gradient of a loss-function with respect to every weight in a neural-network by traversing the computational-graph in reverse and applying the chain-rule at each node.

What It Does

Given a scalar output L (the loss), backpropagation answers: “By how much does L change if I nudge weight w by a tiny amount?” The answer is ∂L/∂w, also written w.grad. Once these gradients are known, gradient-descent can update every weight in the direction that reduces L.

The Algorithm

Step 1 — Forward Pass

Evaluate the network’s mathematical expression from inputs to loss, recording every intermediate value in the computational-graph.

Step 2 — Seed the Root

Set L.grad = 1.0. This initialises the chain: the derivative of L with respect to itself is 1.

Step 3 — Topological Sort

Order all nodes so that every node appears before the nodes that depend on it. This guarantees each node’s incoming gradient is fully accumulated before it propagates further backward.

Step 4 — Apply Local Gradients

For each node out = f(a, b) in reverse topological order:

a.grad += (∂out/∂a) × out.grad
b.grad += (∂out/∂b) × out.grad

∂out/∂a is the local gradient — the derivative of this one operation in isolation. out.grad is the already-computed gradient flowing in from downstream. Their product, added to a.grad, implements the chain-rule.

Common Local Gradients

Operation	Local gradient for input `a`
`out = a + b`	1
`out = a × b`	`b`
`out = a^n`	`n × a^(n-1)`
`out = tanh(a)`	`1 − tanh(a)²`
`out = exp(a)`	`exp(a)`

Why Topological Sort Matters

If node C depends on A and B, C’s gradient must be computed before A’s and B’s. Topological sort (depth-first, appending self after all children) produces an ordering where the gradient is always fully formed before it is used.