Loss Function
A loss function (also called cost function or objective function) measures how wrong a neural-network’s predictions are. It reduces all prediction errors to a single scalar L. Training is the process of minimising L over the dataset by adjusting the network’s weights via gradient-descent.
Role in Training
The loss is the root node of the computational-graph. backpropagation begins here:
- Set
L.grad = 1.0 - Propagate gradients backward through every operation to every weight
Because the loss is scalar, one backward pass yields ∂L/∂w for every parameter simultaneously — a key property of reverse-mode automatic-differentiation.
Common Loss Functions
| Loss | Use Case | Formula |
|---|---|---|
| Mean Squared Error (MSE) | Regression | Σ(ŷ − y)² / n |
| Cross-Entropy | Classification | −Σ y log(ŷ) |
| Binary Cross-Entropy | Binary classification | −[y log(ŷ) + (1−y)log(1−ŷ)] |
| Next-Token Prediction | Language modelling (pretraining) | Cross-entropy over vocabulary |
Relationship to Gradient Descent
gradient-descent uses the gradients from backpropagation to update weights:
w ← w − lr × ∂L/∂w
A lower loss after the update means the network’s predictions improved. Iterating this over many batches of data is the entire training algorithm.
Sources
- karpathy-2022-micrograd-backpropagation — uses loss as the starting point of backpropagation in micrograd