Build a Large Language Model (From Scratch) — Sebastian Raschka (2024)
Author: sebastian-raschka (ML researcher and educator)
Publisher: manning-publications (Manning Publications Co., LLC)
Year: 2024
ISBN: 9781633437166
Raw file: raw/papers/Build a large language model.md
Note: This is the MEAP (Manning Early Access Program) edition — includes Chapters 1–5 and Appendices A–D. Chapters 6–7 (fine-tuning for classification and RLHF) are referenced but not yet included.
Overview
Raschka builds a GPT-like language model from the ground up, using PyTorch without relying on high-level LLM libraries. The philosophy: deep understanding requires implementation. The approach follows the actual GPT-2 architecture step by step, producing a working 124M-parameter model by the end of Chapter 5.
The book has three stages:
- Implement the architecture — attention mechanism, GPT model structure (Ch 2–4)
- Pretrain — next-token prediction on text data (Ch 5)
- Fine-tune — instruction following and classification (Ch 6–7, not in MEAP)
Chapter 1 — Understanding Large Language Models
LLMs are deep neural networks trained on massive text corpora to predict the next word. The “large” refers both to parameter count (tens to hundreds of billions) and training data scale.
Two-stage training paradigm:
- Pretraining — self-supervised next-word prediction on unlabeled internet text; creates the foundation model
- Fine-tuning — supervised training on smaller labeled datasets for specific tasks (instruction following, classification)
GPT architecture key facts:
- GPT = Generative Pretrained Transformer (Radford et al., OpenAI, 2018)
- Uses only the decoder portion of the original encoder–decoder transformer
- GPT-3: 96 transformer layers, 175 billion parameters
- ChatGPT: GPT-3 fine-tuned via InstructGPT (RLHF)
- Modern LLMs (Llama, etc.) still use the same GPT-2 core
Emergent behaviour: GPT models can translate, classify, and summarise despite being trained only on next-word prediction — capabilities that were not explicitly trained and surprised researchers.
Chapter 2 — Working with Text Data
Tokenization
Text must be converted to numbers before an LLM can process it. Steps:
- Split text into tokens (words, subwords, or characters)
- Build a vocabulary mapping tokens → integer IDs
- Encode inputs to token ID sequences; decode outputs back to text
The tokenizer has two methods: encode(text) → [ids] and decode([ids]) → text.
Special Tokens
| Token | Purpose |
|---|---|
<|unk|> | Unknown word not in vocabulary |
<|endoftext|> | Separator between independent documents in training data |
Byte Pair Encoding (BPE)
BPE is the tokenization algorithm used by GPT-2/GPT-3. It iteratively merges the most frequent adjacent token pairs to build a vocabulary, enabling subword tokenisation. Unknown words are broken into known subwords or individual characters — no <|unk|> needed.
GPT-2 BPE vocabulary size: 50,257 tokens.
Data Sampling with a Sliding Window
Training samples are created by sliding a window of context_length tokens over the text. Each input is a sequence of tokens; the target is the same sequence shifted one position to the right (next-token prediction).
Token Embeddings and Positional Embeddings
- Token embedding layer — a learnable lookup table of shape
[vocab_size, embed_dim]; converts each token ID to a dense vector - Positional embedding layer — a learnable lookup table of shape
[context_length, embed_dim]; encodes position information (absolute positional embeddings in GPT-2) - Final input = token embedding + positional embedding
Chapter 3 — Coding Attention Mechanisms
Motivation: The Problem with RNNs
Encoder–decoder RNNs (pre-transformer) compressed the entire input into a single hidden state vector — a bottleneck causing loss of context in long sequences. Bahdanau attention (2014) let the decoder selectively access all encoder hidden states, solving this for RNNs. Transformers (2017) replaced the sequential RNN entirely with a pure attention mechanism.
Self-Attention
Self-attention allows each position in a sequence to attend to all positions in the same sequence. For each input token, it computes a context vector — an enriched representation that incorporates information from all other tokens.
Three steps:
- Compute attention scores (dot products of query with all keys)
- Normalize with softmax → attention weights (sum to 1, always positive)
- Compute context vector = weighted sum of value vectors
Trainable weight matrices Q, K, V project input embeddings into query, key, and value spaces:
Q = X × W_Q— what the token is looking forK = X × W_K— what the token offers for others to findV = X × W_V— the actual information the token contributes
Attention score scaling: divide by sqrt(d_k) to prevent gradients from vanishing in the softmax.
Causal (Masked) Self-Attention
GPT is autoregressive: when predicting token t, it cannot attend to tokens after position t. A causal mask sets attention scores for future positions to −∞ before softmax, which drives their weights to zero. Implemented efficiently with an upper-triangular mask matrix.
Dropout is applied to the attention weights matrix during training.
Multi-Head Attention
Run multiple self-attention operations (“heads”) in parallel, each with its own Q, K, V weights. Outputs are concatenated and projected back to the model dimension. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
GPT-2 (124M): 12 attention heads, embed dim 768, head dim 64.
Chapter 4 — Implementing a GPT Model from Scratch
Transformer Block Components
| Component | Purpose |
|---|---|
| Layer Normalisation (Pre-LayerNorm) | Normalises inputs to each sub-layer; stabilises gradients; applied before multi-head attention and FFN |
| Multi-Head Attention | Core self-attention layer |
| Dropout | Regularisation; applied after attention, after FFN |
| Shortcut/Residual Connections | Skip connections from input to output of each sub-layer; prevent vanishing gradients in deep networks |
| Feed-Forward Network (FFN) | Two linear layers with GELU activation; expand to 4× embed dim then contract; applied token-wise |
| GELU activation | Smooth non-linearity used in GPT (vs ReLU); better gradient properties |
Full GPT Architecture
Input tokens
→ Token embedding + Positional embedding
→ [Transformer block] × N_layers
Pre-LayerNorm
Multi-Head Attention (causal mask)
Residual add
Pre-LayerNorm
Feed-Forward Network (GELU)
Residual add
→ Final LayerNorm
→ Linear projection (embed_dim → vocab_size)
→ Logits over vocabulary
GPT-2 (124M) Configuration
| Parameter | Value |
|---|---|
| Vocabulary size | 50,257 |
| Context length | 1,024 tokens |
| Embedding dimension | 768 |
| Number of transformer layers | 12 |
| Number of attention heads | 12 |
| Dropout rate | 0.1 |
| Total parameters | ~124M |
Text Generation
Greedy decoding:
- Pass input token IDs through the model → logits
[batch, seq, vocab] - Take logits for the last position
- Apply softmax → probability distribution over vocabulary
- Select the token with the highest probability (argmax)
- Append to input sequence; repeat
Without training, the model generates incoherent text because weights are random.
Chapter 5 — Pretraining on Unlabeled Data
Loss Function: Cross-Entropy
Cross-entropy loss measures how well the model’s predicted probability distribution matches the target tokens. It equals the negative average log probability of the correct next tokens — lower is better. PyTorch: F.cross_entropy(logits, targets).
Training Loop
Standard deep learning loop: forward pass → compute loss → backward pass (backpropagation) → gradient descent (AdamW) → update weights. Train/validation split tracks overfitting.
Decoding Strategies
Beyond greedy decoding, the book covers:
- Temperature scaling — divide logits by temperature T before softmax; T < 1 sharpens distribution (more deterministic); T > 1 flattens it (more random/creative)
- Top-k sampling — restrict sampling to the k most probable tokens; prevents the model from selecting very low-probability tokens
Loading Pretrained Weights
The book demonstrates loading GPT-2 weights from OpenAI (via HuggingFace), allowing a model implemented from scratch to produce coherent text immediately, confirming the implementation is correct.
Appendix A — Introduction to PyTorch
Covers: tensors, autograd, backpropagation, neural network layers, training loops, model saving/loading. Prerequisite for readers new to PyTorch.
Key Claims
| Claim | Implication |
|---|---|
| Next-token prediction is sufficient to produce emergent multi-task capabilities | LLM generality requires no task-specific supervision during pretraining |
| GPT-2 architecture is fundamentally unchanged in modern LLMs | Mastering the 2019 implementation gives insight into GPT-4, Llama 3, etc. |
| Hands-on implementation from scratch is the best way to understand LLMs | Reading about attention ≠ understanding it; coding it does |
| Loading pretrained weights into a scratch implementation verifies correctness | Implementation validation without training from scratch |
Entities Mentioned
- sebastian-raschka — author; ML researcher and educator
- manning-publications — publisher
- openai — created GPT, GPT-2, GPT-3, ChatGPT, InstructGPT; GPT-2 weights loaded in Ch 5
Related concepts: attention-mechanism | tokenization | word-embeddings | gpt-architecture | transformer-architecture | large-language-models | pretraining