Gradient Descent

Gradient descent is the optimisation algorithm used to train neural-networks. After backpropagation computes ∂L/∂w (the rate at which the loss-function L changes with respect to each weight w), gradient descent updates every weight by subtracting a small fraction of the gradient:

w ← w − lr × ∂L/∂w

where lr is the learning rate — a hyperparameter controlling step size.


Why It Works

The gradient ∂L/∂w points in the direction of steepest increase of L. Subtracting it moves w toward lower loss. Repeating this over many steps descends the high-dimensional loss landscape toward a minimum.


Variants

VariantDescription
Batch gradient descentGradient over the full dataset; accurate but slow
Stochastic GD (SGD)Gradient over a single sample; fast but noisy
Mini-batch SGDGradient over a small batch (e.g., 32–512 samples); standard in practice
AdamAdaptive per-parameter learning rates with momentum; dominant in deep learning

Learning Rate

Too high a learning rate causes the optimiser to overshoot and diverge. Too low causes slow convergence. Learning rate schedules (warm-up, cosine decay) are standard in large-model training.


Relationship to Backpropagation

backpropagation computes the gradients; gradient descent uses them. The two together form the complete training loop for neural-networks. automatic-differentiation in libraries like pytorch makes both steps nearly invisible: call .backward() then optimizer.step().


Sources