Neural Network

A neural network is a mathematical expression — a computational-graph — that maps input data to outputs through a sequence of operations parameterised by learnable weights. Training a neural network means finding weights that minimise a loss-function measuring prediction error, which is done by backpropagation and gradient-descent.


Key Insight

Neural networks are not magic; they are structured mathematical expressions. The same backpropagation algorithm that computes gradients through any expression graph computes gradients through a neural network, because a network is just a particular kind of graph. andrej-karpathy’s micrograd demonstrates this: a 100-line autograd engine suffices to train a complete network.


Components

Neuron

The basic unit. A neuron computes:

output = activation( Σ(wᵢ × xᵢ) + b )

where xᵢ are inputs, wᵢ are weights, b is a bias, and activation is a nonlinear function (e.g., tanh, ReLU, sigmoid). Without the nonlinearity, stacked layers collapse to a single linear transformation.

Layer

A collection of neurons operating in parallel on the same inputs, producing a vector of outputs. Each neuron in a layer has its own independent weights.

Multi-Layer Perceptron (MLP)

The simplest deep network: a multi-layer-perceptron is a sequence of layers where the output of one layer feeds the input of the next. See multi-layer-perceptron for details.


Training Pipeline

  1. Forward pass — evaluate the expression from input to loss
  2. Backward pass — run backpropagation to compute gradients
  3. Weight update — adjust weights with gradient-descent
  4. Repeat until loss converges

Scalars vs. Tensors

Pedagogically, networks can be implemented with scalar operations (as in micrograd). In practice, scalars are grouped into tensors (multi-dimensional arrays), enabling parallel hardware execution. pytorch and JAX do this transparently; the mathematics is unchanged.


Relationship to LLMs

large-language-models and the transformer-architecture are specialised neural networks at enormous scale. The same forward/backward training loop applies; the architecture adds attention mechanisms and positional encodings on top of the same fundamental computation.


Sources