Multi-Layer Perceptron (MLP)

A multi-layer perceptron (MLP) is the foundational deep neural-network architecture: a directed sequence of fully connected (dense) layers, each followed by a nonlinear activation function. Inputs flow forward through every layer; backpropagation flows gradients backward through the same path during training.


Architecture

Input → [Layer 1] → [Layer 2] → … → [Layer N] → Output

Each layer contains k neurons operating in parallel. Each neuron computes:

output = tanh( w₁x₁ + w₂x₂ + … + wₙxₙ + b )

The output vector of layer i is the input vector of layer i+1.

Why Nonlinearity Matters

Without an activation function (e.g., tanh, ReLU), stacking layers is equivalent to a single linear transformation — the network cannot learn non-linear patterns. Nonlinear activations are what makes “deep” networks more expressive than “wide” ones.


MLP as a Class Hierarchy

andrej-karpathy’s micrograd implements the MLP in ~50 lines with three classes:

ClassResponsibility
NeuronSingle dot product + bias + tanh
LayerList of Neurons operating in parallel
MLPList of Layers chained sequentially

This hierarchy maps cleanly to how all neural network libraries (including pytorch) organise models.


Training

  1. Forward pass through all layers to compute predictions and loss-function
  2. backpropagation computes ∂loss/∂w for every weight w
  3. gradient-descent updates w ← w − lr × w.grad
  4. Reset gradients to zero; repeat

Relationship to Modern Architectures

MLPs are the conceptual ancestor of all deep networks. The transformer-architecture and gpt-architecture use MLPs as the feed-forward sublayer within each transformer block (the FFN / MLP component). MLPs are also used in mixture-of-experts routing networks.


Sources