Build a Large Language Model (From Scratch) — Sebastian Raschka (2024)

Author: sebastian-raschka (ML researcher and educator) Publisher: manning-publications (Manning Publications Co., LLC) Year: 2024 ISBN: 9781633437166 Raw file: raw/papers/Build a large language model.md Note: This is the MEAP (Manning Early Access Program) edition — includes Chapters 1–5 and Appendices A–D. Chapters 6–7 (fine-tuning for classification and RLHF) are referenced but not yet included.


Overview

Raschka builds a GPT-like language model from the ground up, using PyTorch without relying on high-level LLM libraries. The philosophy: deep understanding requires implementation. The approach follows the actual GPT-2 architecture step by step, producing a working 124M-parameter model by the end of Chapter 5.

The book has three stages:

  1. Implement the architecture — attention mechanism, GPT model structure (Ch 2–4)
  2. Pretrain — next-token prediction on text data (Ch 5)
  3. Fine-tune — instruction following and classification (Ch 6–7, not in MEAP)

Chapter 1 — Understanding Large Language Models

LLMs are deep neural networks trained on massive text corpora to predict the next word. The “large” refers both to parameter count (tens to hundreds of billions) and training data scale.

Two-stage training paradigm:

  1. Pretraining — self-supervised next-word prediction on unlabeled internet text; creates the foundation model
  2. Fine-tuning — supervised training on smaller labeled datasets for specific tasks (instruction following, classification)

GPT architecture key facts:

  • GPT = Generative Pretrained Transformer (Radford et al., OpenAI, 2018)
  • Uses only the decoder portion of the original encoder–decoder transformer
  • GPT-3: 96 transformer layers, 175 billion parameters
  • ChatGPT: GPT-3 fine-tuned via InstructGPT (RLHF)
  • Modern LLMs (Llama, etc.) still use the same GPT-2 core

Emergent behaviour: GPT models can translate, classify, and summarise despite being trained only on next-word prediction — capabilities that were not explicitly trained and surprised researchers.


Chapter 2 — Working with Text Data

Tokenization

Text must be converted to numbers before an LLM can process it. Steps:

  1. Split text into tokens (words, subwords, or characters)
  2. Build a vocabulary mapping tokens → integer IDs
  3. Encode inputs to token ID sequences; decode outputs back to text

The tokenizer has two methods: encode(text) → [ids] and decode([ids]) → text.

Special Tokens

TokenPurpose
<|unk|>Unknown word not in vocabulary
<|endoftext|>Separator between independent documents in training data

Byte Pair Encoding (BPE)

BPE is the tokenization algorithm used by GPT-2/GPT-3. It iteratively merges the most frequent adjacent token pairs to build a vocabulary, enabling subword tokenisation. Unknown words are broken into known subwords or individual characters — no <|unk|> needed.

GPT-2 BPE vocabulary size: 50,257 tokens.

Data Sampling with a Sliding Window

Training samples are created by sliding a window of context_length tokens over the text. Each input is a sequence of tokens; the target is the same sequence shifted one position to the right (next-token prediction).

Token Embeddings and Positional Embeddings

  • Token embedding layer — a learnable lookup table of shape [vocab_size, embed_dim]; converts each token ID to a dense vector
  • Positional embedding layer — a learnable lookup table of shape [context_length, embed_dim]; encodes position information (absolute positional embeddings in GPT-2)
  • Final input = token embedding + positional embedding

Chapter 3 — Coding Attention Mechanisms

Motivation: The Problem with RNNs

Encoder–decoder RNNs (pre-transformer) compressed the entire input into a single hidden state vector — a bottleneck causing loss of context in long sequences. Bahdanau attention (2014) let the decoder selectively access all encoder hidden states, solving this for RNNs. Transformers (2017) replaced the sequential RNN entirely with a pure attention mechanism.

Self-Attention

Self-attention allows each position in a sequence to attend to all positions in the same sequence. For each input token, it computes a context vector — an enriched representation that incorporates information from all other tokens.

Three steps:

  1. Compute attention scores (dot products of query with all keys)
  2. Normalize with softmax → attention weights (sum to 1, always positive)
  3. Compute context vector = weighted sum of value vectors

Trainable weight matrices Q, K, V project input embeddings into query, key, and value spaces:

  • Q = X × W_Q — what the token is looking for
  • K = X × W_K — what the token offers for others to find
  • V = X × W_V — the actual information the token contributes

Attention score scaling: divide by sqrt(d_k) to prevent gradients from vanishing in the softmax.

Causal (Masked) Self-Attention

GPT is autoregressive: when predicting token t, it cannot attend to tokens after position t. A causal mask sets attention scores for future positions to −∞ before softmax, which drives their weights to zero. Implemented efficiently with an upper-triangular mask matrix.

Dropout is applied to the attention weights matrix during training.

Multi-Head Attention

Run multiple self-attention operations (“heads”) in parallel, each with its own Q, K, V weights. Outputs are concatenated and projected back to the model dimension. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

GPT-2 (124M): 12 attention heads, embed dim 768, head dim 64.


Chapter 4 — Implementing a GPT Model from Scratch

Transformer Block Components

ComponentPurpose
Layer Normalisation (Pre-LayerNorm)Normalises inputs to each sub-layer; stabilises gradients; applied before multi-head attention and FFN
Multi-Head AttentionCore self-attention layer
DropoutRegularisation; applied after attention, after FFN
Shortcut/Residual ConnectionsSkip connections from input to output of each sub-layer; prevent vanishing gradients in deep networks
Feed-Forward Network (FFN)Two linear layers with GELU activation; expand to 4× embed dim then contract; applied token-wise
GELU activationSmooth non-linearity used in GPT (vs ReLU); better gradient properties

Full GPT Architecture

Input tokens
  → Token embedding + Positional embedding
  → [Transformer block] × N_layers
       Pre-LayerNorm
       Multi-Head Attention (causal mask)
       Residual add
       Pre-LayerNorm
       Feed-Forward Network (GELU)
       Residual add
  → Final LayerNorm
  → Linear projection (embed_dim → vocab_size)
  → Logits over vocabulary

GPT-2 (124M) Configuration

ParameterValue
Vocabulary size50,257
Context length1,024 tokens
Embedding dimension768
Number of transformer layers12
Number of attention heads12
Dropout rate0.1
Total parameters~124M

Text Generation

Greedy decoding:

  1. Pass input token IDs through the model → logits [batch, seq, vocab]
  2. Take logits for the last position
  3. Apply softmax → probability distribution over vocabulary
  4. Select the token with the highest probability (argmax)
  5. Append to input sequence; repeat

Without training, the model generates incoherent text because weights are random.


Chapter 5 — Pretraining on Unlabeled Data

Loss Function: Cross-Entropy

Cross-entropy loss measures how well the model’s predicted probability distribution matches the target tokens. It equals the negative average log probability of the correct next tokens — lower is better. PyTorch: F.cross_entropy(logits, targets).

Training Loop

Standard deep learning loop: forward pass → compute loss → backward pass (backpropagation) → gradient descent (AdamW) → update weights. Train/validation split tracks overfitting.

Decoding Strategies

Beyond greedy decoding, the book covers:

  • Temperature scaling — divide logits by temperature T before softmax; T < 1 sharpens distribution (more deterministic); T > 1 flattens it (more random/creative)
  • Top-k sampling — restrict sampling to the k most probable tokens; prevents the model from selecting very low-probability tokens

Loading Pretrained Weights

The book demonstrates loading GPT-2 weights from OpenAI (via HuggingFace), allowing a model implemented from scratch to produce coherent text immediately, confirming the implementation is correct.


Appendix A — Introduction to PyTorch

Covers: tensors, autograd, backpropagation, neural network layers, training loops, model saving/loading. Prerequisite for readers new to PyTorch.


Key Claims

ClaimImplication
Next-token prediction is sufficient to produce emergent multi-task capabilitiesLLM generality requires no task-specific supervision during pretraining
GPT-2 architecture is fundamentally unchanged in modern LLMsMastering the 2019 implementation gives insight into GPT-4, Llama 3, etc.
Hands-on implementation from scratch is the best way to understand LLMsReading about attention ≠ understanding it; coding it does
Loading pretrained weights into a scratch implementation verifies correctnessImplementation validation without training from scratch

Entities Mentioned


Related concepts: attention-mechanism | tokenization | word-embeddings | gpt-architecture | transformer-architecture | large-language-models | pretraining