Loss Function

A loss function (also called cost function or objective function) measures how wrong a neural-network’s predictions are. It reduces all prediction errors to a single scalar L. Training is the process of minimising L over the dataset by adjusting the network’s weights via gradient-descent.


Role in Training

The loss is the root node of the computational-graph. backpropagation begins here:

  1. Set L.grad = 1.0
  2. Propagate gradients backward through every operation to every weight

Because the loss is scalar, one backward pass yields ∂L/∂w for every parameter simultaneously — a key property of reverse-mode automatic-differentiation.


Common Loss Functions

LossUse CaseFormula
Mean Squared Error (MSE)RegressionΣ(ŷ − y)² / n
Cross-EntropyClassification−Σ y log(ŷ)
Binary Cross-EntropyBinary classification−[y log(ŷ) + (1−y)log(1−ŷ)]
Next-Token PredictionLanguage modelling (pretraining)Cross-entropy over vocabulary

Relationship to Gradient Descent

gradient-descent uses the gradients from backpropagation to update weights:

w ← w − lr × ∂L/∂w

A lower loss after the update means the network’s predictions improved. Iterating this over many batches of data is the entire training algorithm.


Sources