Karpathy 2022 — Micrograd: Building Backpropagation from Scratch

Source type: Lecture transcript (YouTube: Neural Networks: Zero to Hero) Author: andrej-karpathy Year: 2022 Raw file: raw/articles/Neural_network.docx


Summary

andrej-karpathy — then at OpenAI, now independent — walks through building micrograd, a scalar-valued automatic-differentiation engine he released on GitHub approximately two years prior to the lecture. The lecture starts from a blank Jupyter notebook and ends with a fully functional multi-layer-perceptron trained by backpropagation.

The central claim is that micrograd is all you need to understand neural network training. Everything else — PyTorch, JAX — is efficiency on top of the same mathematics.


Key Ideas

The Value Object and Expression Graphs

The building block of micrograd is a Value object that wraps a scalar number. When you add, multiply, or apply functions to Value objects, micrograd silently builds a computational-graph tracking every operation and its children. This graph records not just the result of the computation (forward pass) but also enough information to run backpropagation in reverse.

Forward and Backward Pass

During the forward pass, the expression graph is evaluated left-to-right to produce an output (e.g., a loss value g). During the backward pass, calling .backward() on g triggers recursive application of the chain-rule through the graph. Each node accumulates .grad — the derivative of the final output with respect to that node. For inputs a and b, .grad tells you exactly how much g changes per unit nudge.

Chain Rule and Local Gradients

Each operation stores its own local gradient — the derivative of its output with respect to its input. backpropagation multiplies local gradients by the incoming gradient from downstream (out.grad) and accumulates the result into each input’s .grad. The process is:

input.grad += local_derivative × out.grad

Key local derivatives implemented in micrograd:

  • Addition: both inputs receive out.grad × 1
  • Multiplication: each input receives out.grad × other.data
  • tanh: input receives out.grad × (1 - t²) where t = tanh(x)

Topological Sort for Correct Ordering

A naive implementation calls _backward() in the wrong order. The fix is topological sort: a depth-first traversal starting at the root (loss node) that ensures every node is processed only after all nodes that depend on it have been processed first. Micrograd builds this list by recursively visiting children before appending self.

Neural Networks as Mathematical Expressions

A key insight of the lecture: neural networks are just mathematical expressions. They take input data and weights as inputs, compute predictions through layered matrix multiplications and nonlinearities, and produce a loss. backpropagation is indifferent to whether the expression is a neural network or any other computation — it simply propagates gradients backward through an arbitrary graph.

Scalars vs. Tensors

Micrograd operates on individual scalars for pedagogical clarity. Production libraries (pytorch, JAX) package scalars into tensors — multi-dimensional arrays — so that hardware parallelism can be exploited. The mathematics is identical; tensors are purely an efficiency mechanism.

The MLP Architecture

Built on top of micrograd’s autograd engine is a minimal multi-layer-perceptron library (nn.py, ~50 lines):

  • Neuron: dot product of inputs and weights, plus bias, passed through tanh
  • Layer: list of neurons operating in parallel
  • MLP: sequence of layers with the output of each feeding the next

The entire neural network library sits atop ~100 lines of autograd code, demonstrating that the conceptual surface area of deep learning is small.


Concepts Introduced

ConceptDescription
backpropagationRecursive application of chain rule through a computational graph
automatic-differentiationSoftware technique for computing exact derivatives of arbitrary programs
computational-graphDAG of operations whose edges represent data dependencies
chain-ruleCalculus rule for composing derivatives through function composition
loss-functionScalar metric measuring prediction error; the root node of the backward pass
gradient-descentIterative weight update by subtracting gradient scaled by a learning rate
multi-layer-perceptronStack of fully connected layers with nonlinear activations; the simplest deep network
neural-networkMathematical expression mapping inputs to outputs via learnable weights

Entities Mentioned

EntityRole
andrej-karpathyAuthor / lecturer; creator of micrograd
microgradScalar autograd engine built in the lecture
pytorchProduction deep learning library cited as the “production version” of micrograd