Multi-Layer Perceptron (MLP)
A multi-layer perceptron (MLP) is the foundational deep neural-network architecture: a directed sequence of fully connected (dense) layers, each followed by a nonlinear activation function. Inputs flow forward through every layer; backpropagation flows gradients backward through the same path during training.
Architecture
Input → [Layer 1] → [Layer 2] → … → [Layer N] → Output
Each layer contains k neurons operating in parallel. Each neuron computes:
output = tanh( w₁x₁ + w₂x₂ + … + wₙxₙ + b )
The output vector of layer i is the input vector of layer i+1.
Why Nonlinearity Matters
Without an activation function (e.g., tanh, ReLU), stacking layers is equivalent to a single linear transformation — the network cannot learn non-linear patterns. Nonlinear activations are what makes “deep” networks more expressive than “wide” ones.
MLP as a Class Hierarchy
andrej-karpathy’s micrograd implements the MLP in ~50 lines with three classes:
| Class | Responsibility |
|---|---|
Neuron | Single dot product + bias + tanh |
Layer | List of Neurons operating in parallel |
MLP | List of Layers chained sequentially |
This hierarchy maps cleanly to how all neural network libraries (including pytorch) organise models.
Training
- Forward pass through all layers to compute predictions and loss-function
- backpropagation computes
∂loss/∂wfor every weightw - gradient-descent updates
w ← w − lr × w.grad - Reset gradients to zero; repeat
Relationship to Modern Architectures
MLPs are the conceptual ancestor of all deep networks. The transformer-architecture and gpt-architecture use MLPs as the feed-forward sublayer within each transformer block (the FFN / MLP component). MLPs are also used in mixture-of-experts routing networks.
Sources
- karpathy-2022-micrograd-backpropagation — primary; implements Neuron / Layer / MLP on micrograd