Attention Mechanism

Attention mechanisms allow neural networks to selectively focus on different parts of an input sequence when producing each output. In transformer-based large-language-models, self-attention is the central computation: every token attends to every other token in the same sequence, producing enriched context vectors.


Historical Context

Before attention: Encoder–decoder RNNs (recurrent neural networks) compressed the entire input into a single hidden state vector — a bottleneck that caused loss of context in long sequences.

Bahdanau attention (2014): Modified encoder-decoder RNNs so the decoder could selectively access all encoder hidden states at each decoding step, weighted by learned relevance scores.

Transformer self-attention (2017): “Attention Is All You Need” (Vaswani et al.) replaced the sequential RNN entirely with a pure multi-head self-attention mechanism, enabling full parallelisation during training.


Self-Attention

Self-attention differs from cross-attention (between two sequences): here, the same sequence acts as both the source and target of attention. Each position can attend to all positions in the same sequence.

Query, Key, Value Projections

Three learned weight matrices project input embeddings into three distinct spaces:

ProjectionMatrixRole
Query (Q)W_QWhat this token is looking for
Key (K)W_KWhat this token offers to other tokens’ queries
Value (V)W_VThe actual information this token contributes

All three are learned during training; they are the mechanism through which the model learns what to attend to.

Three-Step Computation

Step 1 — Attention scores (raw affinities)

scores = Q × K^T / sqrt(d_k)

Dot products measure similarity between query and each key. Divided by sqrt(d_k) to prevent gradients from vanishing in the softmax for large embedding dimensions.

Step 2 — Attention weights (normalised)

weights = softmax(scores)

Softmax normalises scores to sum to 1, producing an interpretable probability-like distribution. High weight = high relevance.

Step 3 — Context vector (weighted aggregation)

context = weights × V

The context vector is the attention-weighted sum of all value vectors — an enriched representation of the current token that incorporates information from all positions in the sequence.


Causal (Masked) Self-Attention

GPT-like models are autoregressive: when predicting token at position t, they must not attend to tokens at positions t+1, t+2, … (future information that wouldn’t exist at inference time). The causal mask enforces this:

  • Set attention scores for all positions j > i to −∞ before softmax
  • Softmax of −∞ → 0, so those positions receive zero weight
  • Implemented as an upper-triangular mask matrix; applied element-wise before softmax

Dropout is applied to the attention weight matrix during training as a regularisation measure.


Multi-Head Attention

Instead of one set of Q, K, V matrices, multi-head attention runs h parallel attention operations (“heads”), each with its own learned weights:

  1. Project input into h separate Q, K, V triplets (each of dimension d_k = d_model / h)
  2. Compute scaled dot-product attention independently for each head
  3. Concatenate all h context vectors
  4. Apply a final linear projection W_O to project back to d_model

Why multiple heads? Each head can specialise in attending to different types of relationships (syntactic, semantic, positional) at different distances in the sequence. The concatenation combines these diverse representations.

GPT-2 (124M) configuration: 12 heads, embed dim 768, head dim 64 (768 / 12).


Attention in the GPT Transformer Block

Within each transformer block, multi-head self-attention (causal) is followed by a residual connection and layer normalisation, then a feed-forward network. This is repeated N times. See gpt-architecture for the full block structure.


Complexity

Self-attention is O(n²) in sequence length n — every pair of tokens interacts. This quadratic cost motivates approximations in long-context models:

  • Group Query Attention (GQA) — shares K/V heads across multiple Q heads; reduces memory but not compute
  • Sliding Window Attention — limits each token to attending only within a local window; O(n·w)
  • Multi-Head Latent Attention (MLA)deepseek’s KV cache compression; reduces memory bandwidth

Sources: raschka-2024-build-llm-from-scratch | fridman-lambert-raschka-2026-state-of-ai