Gradient Descent
Gradient descent is the optimisation algorithm used to train neural-networks. After backpropagation computes ∂L/∂w (the rate at which the loss-function L changes with respect to each weight w), gradient descent updates every weight by subtracting a small fraction of the gradient:
w ← w − lr × ∂L/∂w
where lr is the learning rate — a hyperparameter controlling step size.
Why It Works
The gradient ∂L/∂w points in the direction of steepest increase of L. Subtracting it moves w toward lower loss. Repeating this over many steps descends the high-dimensional loss landscape toward a minimum.
Variants
| Variant | Description |
|---|---|
| Batch gradient descent | Gradient over the full dataset; accurate but slow |
| Stochastic GD (SGD) | Gradient over a single sample; fast but noisy |
| Mini-batch SGD | Gradient over a small batch (e.g., 32–512 samples); standard in practice |
| Adam | Adaptive per-parameter learning rates with momentum; dominant in deep learning |
Learning Rate
Too high a learning rate causes the optimiser to overshoot and diverge. Too low causes slow convergence. Learning rate schedules (warm-up, cosine decay) are standard in large-model training.
Relationship to Backpropagation
backpropagation computes the gradients; gradient descent uses them. The two together form the complete training loop for neural-networks. automatic-differentiation in libraries like pytorch makes both steps nearly invisible: call .backward() then optimizer.step().
Sources
- karpathy-2022-micrograd-backpropagation — implements gradient descent manually on the Value.grad attributes