Pretraining (LLMs)

Pretraining is the first and most computationally intensive stage of building an LLM. The model is trained on massive unlabeled text corpora using a self-supervised objective: predicting the next token in a sequence. No human-generated labels are required — the labels are the tokens themselves.


Next-Token Prediction

The training objective: given a sequence of tokens [t_1, t_2, ..., t_n], predict t_{n+1}.

This is a self-supervised task because labels are derived automatically from the data structure — the “next word” acts as the label, without any annotation effort. The enormous availability of unlabeled text (most of the internet) is why LLMs can train on trillions of tokens.

Despite its simplicity, next-token prediction yields models with emergent multi-task capabilities: translation, summarisation, question answering, code generation — none of which was explicitly trained.


Training Data

LLMs require billions to trillions of tokens. GPT-3 was trained on approximately 300 billion tokens from:

  • CommonCrawl (filtered web text)
  • WebText2 (OpenAI’s curated web scrape)
  • Books (Books1, Books2)
  • Wikipedia
  • Academic papers

Higher-quality data (academic papers, curated books) is weighted more heavily than raw web text. Synthetic data is increasingly used to augment scarce high-quality sources.


Loss Function: Cross-Entropy

Cross-entropy loss measures the difference between the model’s predicted probability distribution and the true next token:

loss = -log(P(correct_next_token))

Averaged across all positions in the batch. The goal is to drive the probability assigned to the correct next token as close to 1 as possible.


Training Loop

for each batch:
    logits = model(input_token_ids)        # forward pass
    loss = cross_entropy(logits, targets)  # compute loss
    loss.backward()                        # backpropagation
    optimizer.step()                       # update weights (AdamW)
    optimizer.zero_grad()                  # clear gradients

Evaluation: A held-out validation split tracks whether the model is generalising or overfitting. Training loss and validation loss are plotted to monitor progress.


Foundation Model → Fine-Tuned Model

Pretraining produces a foundation model (base model): it can predict next tokens well but doesn’t follow instructions or behave as an assistant. Subsequent stages adapt it:

  1. Supervised fine-tuning (SFT) — train on instruction–response pairs
  2. RLVR / RLHF — align with human preferences; see reinforcement-learning-from-human-feedback and rlvr

Fine-tuning is far cheaper than pretraining — the foundation model’s knowledge is preserved; fine-tuning shapes how it uses that knowledge.


Compute Cost

Pretraining a frontier model requires thousands of GPUs running for months, costing millions to hundreds of millions of dollars. Educational implementations (like raschka-2024-build-llm-from-scratch) train on tiny datasets (a few thousand tokens) to demonstrate the mechanics, then load pretrained weights (e.g., GPT-2 from OpenAI) to validate the implementation.


Sources: raschka-2024-build-llm-from-scratch | large-language-models