SOLID STATE PRESS
← Back to catalog
Pretraining, Fine-Tuning, and RLHF cover
Coming soon
Coming soon to Amazon
This title is in our publishing queue.
Browse available titles
Artificial Intelligence

Pretraining, Fine-Tuning, and RLHF

A High School & College Primer on How Modern LLMs Are Built in Three Stages

You've heard that ChatGPT was "trained on the internet" — but what does that actually mean? And what's the difference between a raw model and the polished chatbot you talk to every day? If you've tried to find answers and hit a wall of jargon, this guide is for you.

**TLDR: Pretraining, Fine-Tuning, and RLHF** walks you through the three stages that turn a blank neural network into a working AI assistant. You'll learn how large language model training for beginners actually works — starting with next-token prediction on trillions of words, moving through supervised fine-tuning on human-written examples, and finishing with the reinforcement learning from human feedback process that teaches the model to be helpful rather than just fluent. Each stage is explained with concrete numbers, plain language, and honest coverage of what can go wrong: hallucination, sycophancy, and reward hacking all get named and explained.

This guide is written for high school students, college freshmen, and anyone who wants a real mental model of modern AI — not a marketing summary and not a graduate-school textbook. It covers the same concepts taught in AI and machine learning courses, compressed into a focused read you can finish in an afternoon.

No calculus required. No prior AI background assumed. Just clear explanations of ideas that actually matter.

Pick it up, read it once, and walk into your next AI class or conversation ready to participate.

What you'll learn
  • Explain what a large language model is and what 'next-token prediction' actually means
  • Describe what happens during pretraining, including data, compute, and loss
  • Distinguish supervised fine-tuning from pretraining and explain why instruction data matters
  • Understand how RLHF uses a reward model and PPO to align model outputs with human preferences
  • Recognize the limitations and failure modes of each stage (hallucination, reward hacking, sycophancy)
What's inside
  1. 1. What an LLM Actually Is
    Sets up the core object: a neural network that predicts the next token, and why that simple task scales into something that looks like reasoning.
  2. 2. Stage 1: Pretraining on the Internet
    Covers the first and most expensive stage — training on trillions of tokens of text to learn language, facts, and patterns through cross-entropy loss.
  3. 3. Stage 2: Supervised Fine-Tuning
    Explains how a base model becomes an instruction-follower by training on curated prompt-response pairs written by humans.
  4. 4. Stage 3: RLHF and the Reward Model
    Walks through reinforcement learning from human feedback — collecting preference rankings, training a reward model, and optimizing the LLM with PPO.
  5. 5. What Goes Wrong and What Comes Next
    Surveys the known failure modes of each stage (hallucination, sycophancy, reward hacking) and previews newer methods like DPO, RLAIF, and constitutional AI.
Published by Solid State Press
Pretraining, Fine-Tuning, and RLHF cover
TLDR STUDY GUIDES

Pretraining, Fine-Tuning, and RLHF

A High School & College Primer on How Modern LLMs Are Built in Three Stages
Solid State Press

Who This Book Is For

If you are taking an introductory AI or computer science course, preparing for a college interview that touches on machine learning concepts for college students, or simply trying to understand the technology behind ChatGPT, this book was written for you. It also works for high school students curious about neural network pretraining explained at a level that does not require calculus.

This guide covers the complete three-stage pipeline for building modern AI assistants: large language model training for beginners, starting with how raw internet text becomes a base model, then supervised fine-tuning, then the RLHF reinforcement learning from human feedback guide that shapes a model into a helpful chatbot. Along the way you will pick up concrete vocabulary — tokens, reward models, policy gradients, AI alignment and fine-tuning — in about 15 focused pages, no filler.

Read straight through once to build the mental map. Then work the examples inline and attempt the problem set at the end to see how well it stuck.

Contents

  1. 1 What an LLM Actually Is
  2. 2 Stage 1: Pretraining on the Internet
  3. 3 Stage 2: Supervised Fine-Tuning
  4. 4 Stage 3: RLHF and the Reward Model
  5. 5 What Goes Wrong and What Comes Next
Chapter 1

What an LLM Actually Is

At its core, a large language model (LLM) is a program that takes a sequence of text as input and outputs a probability distribution over what word (or word-piece) comes next. That is the whole job. Everything else — the apparent reasoning, the code generation, the ability to answer questions about history — emerges from doing that one job at enormous scale.

To understand how, you need three building blocks: tokens, the prediction task, and the architecture that performs it.

Tokens: the alphabet of an LLM

An LLM does not read text letter by letter or word by word. It reads tokens, which are chunks of text produced by a compression algorithm run on a large body of language. Common words become single tokens ("cat", "run", "the"). Rarer words get split ("unbelievable" might become "un", "believ", "able"). Punctuation and spaces often attach to adjacent tokens. A rough rule of thumb for English: one token is about four characters, so 100 words is roughly 75 tokens. The exact splitting is determined by a tokenizer trained before the model itself.

Why tokens instead of letters? Because predicting the next letter is too fine-grained — the model spends capacity on trivial patterns. Words alone would create a vocabulary too large to handle efficiently. Tokens are a practical middle ground.

Next-token prediction

Given a sequence of tokens, the model assigns a probability to every token in its vocabulary as the candidate for what comes next. That probability distribution is the model's output. During training, the correct answer is simply the actual next token in the training text. The model is penalized — via a quantity called cross-entropy loss — whenever it assigns low probability to the token that actually appeared. Reducing that penalty over billions of examples forces the model to internalize the statistical regularities of language.

Keep reading

You've read the first half of Chapter 1. The complete book covers 5 chapters in roughly fifteen pages — readable in one sitting.

Coming soon to Amazon