Gradient Descent

Gradient descent is the optimisation algorithm used to train neural-networks. After backpropagation computes ∂L/∂w (the rate at which the loss-function L changes with respect to each weight w), gradient descent updates every weight by subtracting a small fraction of the gradient:

w ← w − lr × ∂L/∂w

where lr is the learning rate — a hyperparameter controlling step size.

Why It Works

The gradient ∂L/∂w points in the direction of steepest increase of L. Subtracting it moves w toward lower loss. Repeating this over many steps descends the high-dimensional loss landscape toward a minimum.

Variants

Variant	Description
Batch gradient descent	Gradient over the full dataset; accurate but slow
Stochastic GD (SGD)	Gradient over a single sample; fast but noisy
Mini-batch SGD	Gradient over a small batch (e.g., 32–512 samples); standard in practice
Adam	Adaptive per-parameter learning rates with momentum; dominant in deep learning

Learning Rate

Too high a learning rate causes the optimiser to overshoot and diverge. Too low causes slow convergence. Learning rate schedules (warm-up, cosine decay) are standard in large-model training.

Relationship to Backpropagation

backpropagation computes the gradients; gradient descent uses them. The two together form the complete training loop for neural-networks. automatic-differentiation in libraries like pytorch makes both steps nearly invisible: call .backward() then optimizer.step().

Sources

karpathy-2022-micrograd-backpropagation — implements gradient descent manually on the Value.grad attributes

My Knowledge Base

Explorer

Gradient Descent

Gradient Descent

Why It Works

Variants

Learning Rate

Relationship to Backpropagation

Sources

Graph View

Table of Contents

Backlinks