Karpathy 2022 — Micrograd: Building Backpropagation from Scratch

Source type: Lecture transcript (YouTube: Neural Networks: Zero to Hero) Author: andrej-karpathy Year: 2022 Raw file: raw/articles/Neural_network.docx

Summary

andrej-karpathy — then at OpenAI, now independent — walks through building micrograd, a scalar-valued automatic-differentiation engine he released on GitHub approximately two years prior to the lecture. The lecture starts from a blank Jupyter notebook and ends with a fully functional multi-layer-perceptron trained by backpropagation.

The central claim is that micrograd is all you need to understand neural network training. Everything else — PyTorch, JAX — is efficiency on top of the same mathematics.

Key Ideas

The Value Object and Expression Graphs

The building block of micrograd is a Value object that wraps a scalar number. When you add, multiply, or apply functions to Value objects, micrograd silently builds a computational-graph tracking every operation and its children. This graph records not just the result of the computation (forward pass) but also enough information to run backpropagation in reverse.

Forward and Backward Pass

During the forward pass, the expression graph is evaluated left-to-right to produce an output (e.g., a loss value g). During the backward pass, calling .backward() on g triggers recursive application of the chain-rule through the graph. Each node accumulates .grad — the derivative of the final output with respect to that node. For inputs a and b, .grad tells you exactly how much g changes per unit nudge.

Chain Rule and Local Gradients

Each operation stores its own local gradient — the derivative of its output with respect to its input. backpropagation multiplies local gradients by the incoming gradient from downstream (out.grad) and accumulates the result into each input’s .grad. The process is:

input.grad += local_derivative × out.grad

Key local derivatives implemented in micrograd:

Addition: both inputs receive out.grad × 1
Multiplication: each input receives out.grad × other.data
tanh: input receives out.grad × (1 - t²) where t = tanh(x)

Topological Sort for Correct Ordering

A naive implementation calls _backward() in the wrong order. The fix is topological sort: a depth-first traversal starting at the root (loss node) that ensures every node is processed only after all nodes that depend on it have been processed first. Micrograd builds this list by recursively visiting children before appending self.

Neural Networks as Mathematical Expressions

A key insight of the lecture: neural networks are just mathematical expressions. They take input data and weights as inputs, compute predictions through layered matrix multiplications and nonlinearities, and produce a loss. backpropagation is indifferent to whether the expression is a neural network or any other computation — it simply propagates gradients backward through an arbitrary graph.

Scalars vs. Tensors

Micrograd operates on individual scalars for pedagogical clarity. Production libraries (pytorch, JAX) package scalars into tensors — multi-dimensional arrays — so that hardware parallelism can be exploited. The mathematics is identical; tensors are purely an efficiency mechanism.

The MLP Architecture

Built on top of micrograd’s autograd engine is a minimal multi-layer-perceptron library (nn.py, ~50 lines):

Neuron: dot product of inputs and weights, plus bias, passed through tanh
Layer: list of neurons operating in parallel
MLP: sequence of layers with the output of each feeding the next

The entire neural network library sits atop ~100 lines of autograd code, demonstrating that the conceptual surface area of deep learning is small.

Concepts Introduced

Concept	Description
backpropagation	Recursive application of chain rule through a computational graph
automatic-differentiation	Software technique for computing exact derivatives of arbitrary programs
computational-graph	DAG of operations whose edges represent data dependencies
chain-rule	Calculus rule for composing derivatives through function composition
loss-function	Scalar metric measuring prediction error; the root node of the backward pass
gradient-descent	Iterative weight update by subtracting gradient scaled by a learning rate
multi-layer-perceptron	Stack of fully connected layers with nonlinear activations; the simplest deep network
neural-network	Mathematical expression mapping inputs to outputs via learnable weights

Entities Mentioned

Entity	Role
andrej-karpathy	Author / lecturer; creator of micrograd
micrograd	Scalar autograd engine built in the lecture
pytorch	Production deep learning library cited as the “production version” of micrograd

My Knowledge Base

Explorer

Karpathy 2022 — Micrograd: Building Backpropagation from Scratch

Karpathy 2022 — Micrograd: Building Backpropagation from Scratch

Summary

Key Ideas

The Value Object and Expression Graphs

Forward and Backward Pass

Chain Rule and Local Gradients

Topological Sort for Correct Ordering

Neural Networks as Mathematical Expressions

Scalars vs. Tensors

The MLP Architecture

Concepts Introduced

Entities Mentioned

Graph View

Table of Contents

Backlinks