Neural Network
A neural network is a mathematical expression — a computational-graph — that maps input data to outputs through a sequence of operations parameterised by learnable weights. Training a neural network means finding weights that minimise a loss-function measuring prediction error, which is done by backpropagation and gradient-descent.
Key Insight
Neural networks are not magic; they are structured mathematical expressions. The same backpropagation algorithm that computes gradients through any expression graph computes gradients through a neural network, because a network is just a particular kind of graph. andrej-karpathy’s micrograd demonstrates this: a 100-line autograd engine suffices to train a complete network.
Components
Neuron
The basic unit. A neuron computes:
output = activation( Σ(wᵢ × xᵢ) + b )
where xᵢ are inputs, wᵢ are weights, b is a bias, and activation is a nonlinear function (e.g., tanh, ReLU, sigmoid). Without the nonlinearity, stacked layers collapse to a single linear transformation.
Layer
A collection of neurons operating in parallel on the same inputs, producing a vector of outputs. Each neuron in a layer has its own independent weights.
Multi-Layer Perceptron (MLP)
The simplest deep network: a multi-layer-perceptron is a sequence of layers where the output of one layer feeds the input of the next. See multi-layer-perceptron for details.
Training Pipeline
- Forward pass — evaluate the expression from input to loss
- Backward pass — run backpropagation to compute gradients
- Weight update — adjust weights with gradient-descent
- Repeat until loss converges
Scalars vs. Tensors
Pedagogically, networks can be implemented with scalar operations (as in micrograd). In practice, scalars are grouped into tensors (multi-dimensional arrays), enabling parallel hardware execution. pytorch and JAX do this transparently; the mathematics is unchanged.
Relationship to LLMs
large-language-models and the transformer-architecture are specialised neural networks at enormous scale. The same forward/backward training loop applies; the architecture adds attention mechanisms and positional encodings on top of the same fundamental computation.
Sources
- karpathy-2022-micrograd-backpropagation — builds a neural network from scratch using micrograd
- raschka-2024-build-llm-from-scratch — builds a GPT (large transformer neural network) in pytorch