A from-scratch build book
Small Kernels
Learn the Zig programming language and how large language models actually work — by writing every piece yourself, from a single dot product to GPT-2, a modern Gemma, and the doorway to Whisper.
I originally vibe-coded this book to teach myself Zig and LLMs, and prepare for MLE interviews. Like everyone I find the Claudisms off-putting, so my original intention was to rewrite everything by hand before publishing.
Unfortunately, I was hired. And my new employer is more strict about publishing side-projects, so I am publishing this as-is before I start my new job. It is still very helpful, so I recommend you go through the material, eventually building inference engines for popular LLMs and ASR models in Zig.
If you find any errors or ideas for improvement, feel free to reach out.
If you find this book helpful, please donate to Zig!
Read me first: two skills, one journey
You're going to learn two things at once — a programming language called Zig, and how a large language model actually works — and you're going to learn them the only way that really sticks: by building, from nothing, every small piece. We start with a function that multiplies two lists of numbers. We end with your own code generating text from GPT-2, then from a modern Gemma model, with a clear path into speech recognition. No framework does the hard part for you. By the last page you will understand every number that flows through a transformer, because you will have typed the code that moves it.
Do you need to know how to code? No.
This book assumes no prior Zig and no real systems-programming background. If you've written a little code in any language — a loop, a function, an if — you have enough. Part 1 teaches Zig from the first line. If you've never programmed at all, you can still follow along; just go slower through Part 1 and run every example. The math is the same story: you need comfort with high-school algebra (multiplying, summing, a little notation) and a willingness to meet ideas like the dot product and the derivative when they arrive. We build the math up exactly as much as we need and not one symbol more.
Because an LLM is, underneath, a giant pile of numbers and a handful of loops over them — and Zig lets you see those loops with nothing hidden. There's no garbage collector deciding when to pause, no hidden allocations, no operator that secretly calls a library. When you write for (a, b) |x, y| sum += x * y; you are looking at the actual arithmetic the CPU will do. That transparency is a curse for quick scripting and a gift for learning how a model really runs. As a bonus you'll come out knowing a genuinely useful low-level language.
The shape of the journey
The book runs as a single staircase. Each step is small, each step runs, and each step is used by the next. We call the steps kernels — a word with two meanings we'll lean on. To a systems programmer a "kernel" is a tight loop that does one numerical job (a matrix multiply, a softmax). To everyone else it means the core, the seed of a thing. Both are right here: we grow an LLM one small numerical kernel at a time, and each kernel is the seed of the next idea.
How to actually use these pages
- Type the code, don't paste it. The entire value is in your fingers and your debugger. Treat every code block as something to re-implement, not a snippet to copy. The Copy button exists for when you're comparing your version against mine, not for skipping the typing.
- Run everything. After every kernel there's a way to see it work — a test, a printed number, a generated word. Never read past a block you haven't run. A model that "should" work and a model you've watched work are different kinds of knowledge.
- Keep an oracle. From Part 3 on, we check our numbers against a known-correct reference — a few lines of Python with NumPy or PyTorch. When your output disagrees with the oracle, you are wrong until proven otherwise. Appendix D is the recipe book for building these.
- Mark sections complete with the button in each header. Your progress and theme are saved in this file's local storage, so you can close the tab and pick up next week.
- It's okay to be slow. This is a staircase you can climb over months of evenings. Every part ends in something that runs, so you're never far from a satisfying checkpoint.
All code targets Zig 0.16, the line current as this book was written. Zig is pre-1.0 and changes between releases — the standard library and the build.zig format in particular get reworked. If a snippet doesn't compile, the very first suspect is a version mismatch: run zig version, and if you're on something newer, run zig init in an empty folder and compare its generated files against ours. Appendix A notes the spots most likely to drift.
What you'll have built by the end
Ten parts. Each row is a milestone you can run and feel.
comptime, the build system. You can write, build, and test real Zig.@Vector), spread it across threads, and quantize the weights to int8 — then ship it as a real command-line tool.This isn't three unrelated projects. GPT-2 is the simplest real transformer — a decoder that reads text and predicts the next token. Gemma is the same decoder with every part modernized; you'll swap pieces one at a time and watch a 2018 model become a 2025 one. Whisper is that same transformer with a second half bolted on — an encoder that reads audio and a cross-attention bridge — so the decoder predicts text while listening. Learn the decoder deeply and the other two are mostly rewiring. That's why the curriculum is ordered this way: each model adds exactly one new idea to the last.
A program you wrote in Zig — small-llm -m gpt2.safetensors -p "Once upon a time" — that loads real weights and prints fluent continuation text, where you can point at any stage and explain what every number is doing. From there, a clear on-ramp to a modern Gemma model and to speech. Understanding, not magic. That's the deliverable.
Zig from zero
Six chapters that take you from no Zig at all to writing, building, and testing real programs. We learn only the language — the math starts in Part 2 — but every example leans toward the numerical code to come. By the end you'll have the project skeleton the whole book grows inside.
Setup & your first program
~½ dayTwo things to install: the Zig compiler and an editor that understands Zig. Then we write, run, and dissect a program, so that by the end of this short chapter the tooling is out of the way and you can focus on ideas.
1 · Install Zig 0.16
Download the build for your machine from the official downloads page — pick the 0.16 line. Zig ships as a single self-contained folder; there's nothing to "install" in the system sense. Unpack it somewhere stable and put that folder on your PATH.
On Windows, unzip it and add the folder to your PATH through System Settings, or just run it by full path while you're starting out. Avoid Homebrew/apt for now — packaged Zig often lags the fast-moving releases, and version drift is the single most common reason a beginner's code won't compile.
That one folder contains the compiler, the entire standard library (as readable Zig source you're encouraged to open), and a C/C++ compiler too — Zig can build C projects. There is no separate runtime to install and nothing global to configure. To "uninstall" Zig you delete the folder. This simplicity is very on-brand for the language.
2 · An editor that helps
Use any editor, but install ZLS (the Zig Language Server) for autocomplete, jump-to-definition, and inline errors — it makes learning dramatically faster. VS Code with the official "Zig" extension is the easiest start; Neovim, Helix, and others work well too. Point the extension at your zig binary and let it fetch the matching ZLS.
3 · Hello, kernels
Create a file hello.zig and type this exactly:
Run it directly — no separate compile step needed while prototyping:
4 · Dissecting four lines
Every token here is worth understanding, because you'll write these lines a thousand times.
const std = @import("std");—@importis a builtin (all builtins start with@); it pulls in a module and binds it to the constantstd, the standard library.constmeans "this name never changes."pub fn main() void— declares a function namedmain, the program's entry point.pubmakes it visible outside this file (the runtime needs to see it).voidis the return type: it returns nothing.std.debug.print(fmt, args)— prints a formatted string. The first argument is a format string; the second is the arguments to fill it in..{}— this is an empty anonymous struct literal. Zig passes print's arguments as a struct (here, no arguments, so it's empty). You'll see.{ ... }constantly; for now read it as "a little bundle of values."
Inside the format string, {}-style placeholders are filled from the args bundle in order. The common ones: {s} for a string/byte-slice, {d} for an integer or float in decimal, {d:.4} for a float with 4 decimals, {any} for "format this however you can" (great for arrays while debugging), and {c} for a single byte as a character. The \n is a newline. If the number of placeholders doesn't match the number of args, Zig tells you at compile time.
Let's prove the placeholders work. Numbers and a loop — the raw material of everything ahead:
Run it: you should see the count, the sum 2.300, and the array printed out. Don't worry about the exact types yet — []const u8, f32, [_]f32 — Chapters 2 and 3 unpack them. Right now the win is: you can print anything, which means you can see anything, which is how we'll debug every kernel in this book.
std.debug.print writes to the error stream, not standard output, and it's the tool we use for all the "let me see this number" work throughout the book. Writing real program output to stdout in current Zig goes through a buffered writer, which we'll set up properly in Part 6 when we actually emit generated text. Until then, std.debug.print is all you need and it never gets in your way.
Modify numbers.zig to also print the largest value in weights. You'll need a var that you update inside the loop and the builtin @max(a, b). Then change the array's values and confirm your code adapts. Small, but it's the loop-over-an-array pattern that the dot product (Chapter 7) is built from.
zig version prints a 0.16 release, zig run hello.zig greets you, and you can print strings, integers, floats, and arrays. The tools are out of the way. Now the language.
Values, types & control flow
~1 dayZig is statically typed and refreshingly explicit: the type of every value is known at compile time, integer sizes are spelled out, and conversions between number types are something you ask for, never something that happens behind your back. That last rule trips up newcomers for an afternoon and then becomes a feature you'll miss in other languages. We'll meet it head-on.
Numbers, named by their size
There is no plain int. You say exactly what you mean: u8 is an unsigned 8-bit integer (0–255), i32 a signed 32-bit integer, u64 a big unsigned one. usize is an unsigned integer the width of a memory address — it's the type of every length and index, so you'll use it constantly. For real numbers we use f32 (32-bit float) almost everywhere in this book, because that's what neural network math runs in.
const binds a name to a value that never changes; var is a variable you can reassign. Zig nudges you toward const — if you declare a var and never mutate it, the compiler errors and tells you to make it const. This sounds annoying and is actually a quiet superpower: a var in the code is a signal that "this changes," so loops and accumulators stand out at a glance.
The conversion rule you must internalize
Zig will not silently mix number types. You cannot add an i32 to a usize, and you cannot divide an integer by another and get a float, without saying so. The two builtins you'll reach for most:
@floatFromInt(x)— turn an integer into a float.@intFromFloat(x)— turn a float into an integer (truncating).
These are "result-located": Zig figures out the target type from where the value is going, which is why you often wrap them in @as(T, ...) to name that type explicitly. The classic case is computing an average — dividing a sum by a count:
This is the error you'll hit most as a beginner, and now you know the fix: something is an integer that needs to be a float (or vice-versa), and you must convert it explicitly with @floatFromInt / @intFromFloat. It feels like friction at first. But in numerical code, an accidental integer division (where 3/2 becomes 1, not 1.5) is a silent, vicious bug — and Zig makes it impossible to write by accident.
Control flow
if, while, and for work as you'd expect, with one Zig twist: they can be expressions that produce a value. And the for loop is built for exactly the array-walking we'll do endlessly.
for (items, 0..) |v, i| patternZig's for can walk several things in lockstep. for (a, b) |x, y| iterates two slices of equal length together (we'll use this for the dot product); for (a, 0..) |x, i| pairs each element with its index. The captures between |...| are the loop's local names. There's no C-style for(i=0;i<n;i++) — you use a range 0..n or, for index-and-value, the form above.
switch
switch must be exhaustive — you handle every case or add an else. It's also an expression. We'll use it later to dispatch on data types and token kinds.
Functions
You've seen main; here's a function with parameters and a return type. Parameters are immutable inside the function — another nudge toward clarity. This little function, the mean of a slice, is your first real numerical routine.
Notice &xs in the call: xs is a fixed-size array, and we pass a slice of it (a view) with the &. That distinction — arrays versus slices — is important enough that it gets the whole next chapter. You just wrote the same normalization (subtract the mean, work with the spread) that LayerNorm will use in Chapter 10. The kernels really do start this early.
Write fn dot(a: []const f32, b: []const f32) f32 that returns the sum of a[i] * b[i]. Use the two-slice loop for (a, b) |x, y|. Test it: dot(&.{1,2,3}, &.{4,5,6}) should be 32. Congratulations in advance — that function is the single most important kernel in this entire book, and you'll meet it again, properly, in Chapter 7.
You can declare typed values, convert between integers and floats deliberately, drive if/while/for/switch, and write functions over slices of numbers. You've computed a mean, a variance, and (in the exercise) a dot product. That's most of what an LLM does — just repeated a few billion times.
Slices, arrays & pointers
~1–2 daysThis is the most important chapter in Part 1, because every tensor in this book is a slice of floats. A model's weights, the activations flowing through it, the logits at the end — all of them are, in memory, just long runs of f32 values, and a slice is how Zig hands you a run of values to work on. Get arrays, slices, and pointers clear now and the rest of the book is reading and writing slices.
Arrays: fixed-size, known at compile time
An array type is [N]T — N elements of type T, where N is fixed and known when the program is compiled. [_]T{...} lets the compiler count the elements for you.
An array is a value. If you assign it or pass it to a function by value, the whole thing is copied. That's fine for 3 numbers and ruinous for a weight matrix with millions — which is exactly why slices exist.
Slices: a view into memory
A slice, written []T, is two things bundled: a pointer to the first element and a length. It does not own the data — it's a window onto memory that lives somewhere else. Passing a slice copies only the pointer-and-length (16 bytes), never the elements. This is how you hand a million-element weight matrix to a function for free.
A weight tensor is allocated once and lives in one place. Everything that "uses" it — a matrix multiply, an attention head — receives a slice, a cheap view, and reads through it. Nothing gets copied as data flows through the network unless we explicitly ask. When you internalize "slice = borrowed view, not a copy," Zig's memory model clicks, and so does the reason neural-net code can be fast without magic.
const-ness travels with the slice
[]const f32 is a read-only view: you may read elements but not write them. []f32 is writable. Functions advertise their intent through this: a function taking []const f32 promises not to modify your data; one taking []f32 might. You'll see []const f32 for inputs (weights, the row you're reading) and []f32 for outputs (the buffer you're writing into) all through the book.
Writing through a slice in a loop
To modify elements in place, capture each by pointer with |*x| and write through it with x.*. This in-place pattern is how every elementwise kernel (scale, add, activation) is written.
Pointers, briefly
A slice's "pointer" part is a real Zig type you'll occasionally use directly. *T is a pointer to one T; &x takes the address of x; p.* reads or writes through it. You mostly won't need raw pointers — slices cover 95% of cases — but the |*x| capture above is a pointer, so it's worth seeing plainly.
There is no special string type. A text string is []const u8 — a read-only slice of bytes. A literal like "hello" is stored as a constant array of bytes (with a hidden trailing zero for C compatibility) and coerces to []const u8 automatically. This matters later: a tokenizer's vocabulary is just a list of byte-slices, and decoding text is literally concatenating bytes. The thing that feels exotic in other languages is, in Zig, the same slice you already understand.
Matrices: we flatten them
Zig has multi-dimensional arrays, but for performance and simplicity we will not use them for tensors. Instead we store a matrix as one flat []f32 in row-major order — row 0, then row 1, and so on — and we do the index arithmetic ourselves. Element (r, c) of an R×C matrix lives at flat index r * C + c.
Row-major flat storage is what the CPU loves: the elements of a row sit next to each other in memory, so reading a row streams straight through the cache. It's also exactly how PyTorch, NumPy, and the model files on disk lay out tensors, which means loading real weights later becomes a plain copy. The price is that you compute indices like r * cols + c — a small, mechanical habit you'll build in Chapter 9 and never think about again.
The slice expression m[1 * cols ..][0..cols] deserves a glance: m[start..] slices "from start to the end," then [0..cols] takes the first cols of that — a common idiom for "the cols-length window starting at start." You'll see it whenever we grab one row of a matrix.
Write fn getRow(m: []f32, cols: usize, r: usize) []f32 that returns row r of a row-major matrix as a slice. Then write fn rowSum(m: []const f32, cols: usize, r: usize) f32 using it. This row-grabbing is the inner move of matrix multiply; doing it once by hand now will make Chapter 9 feel like review.
Index past the end of a slice and a Debug build will stop with a clear "index out of bounds" panic and a line number. This safety net is on by default and is one of the best reasons to do all your learning and development in Debug mode. It vanishes in ReleaseFast (for speed), which is why we only switch to that for benchmarking, and only after the code is known-correct.
You can tell an array (a value) from a slice (a borrowed view), read and write through slices, pass big data around for free, and address a matrix stored as a flat row-major buffer. Every tensor for the rest of the book is now within reach — we just need a way to allocate these buffers, which is the next chapter and the defining idea of Zig.
Allocators, memory & errors
~2 daysHere is the idea that most defines Zig, and the one that turns out to be perfect for neural networks: there is no hidden heap. Memory is never allocated behind your back. Any function that needs to allocate takes an allocator as an argument, and you decide what kind. This is a little more typing and a lot more control — and for a forward-only model it lets us do something beautiful with scratch memory that other languages make hard.
Getting an allocator
An allocator is a value of type std.mem.Allocator that you pass around. The standard library offers several; for development the general-purpose allocator is ideal because it detects leaks and double-frees and tells you about them.
Three new things appear here, each worth its own look: !void on main, the try, and the two defers.
In the 0.16 line this development allocator is spelled std.heap.GeneralPurposeAllocator(.{}); some builds also expose it as std.heap.DebugAllocator(.{}) — same thing, the leak-checking allocator you want while learning. If one name doesn't resolve, try the other. For throwaway examples you can also grab std.heap.page_allocator directly (no setup, no leak check). We'll use the GPA as our "real" allocator and an arena (below) for hot scratch.
Errors are values, and try propagates them
Zig has no exceptions. A function that can fail returns an error union, written !T — "either a T or an error." Allocation returns ![]f32 because it can fail with error.OutOfMemory. You deal with the error in one of two ways:
try expr— "unwrap the value, or if it's an error, return that error from my function." This is whymainis now!void: it might pass an error up to the runtime, which prints it.expr catch |err| { ... }— handle it right here.catchcan also supply a default:const v = mightFail() catch 0;.
Numerical code is full of preconditions — matching lengths, valid shapes, a model file that isn't corrupt. Error unions make those preconditions explicit and impossible to ignore: you literally can't use a !f32 as an f32 without confronting the error. Combined with try, the happy path stays clean (no nested if (ok) ladders) while every failure still has to be handled somewhere. Your future self, debugging a weight loader at midnight, will be grateful.
defer: cleanup that can't be forgotten
defer schedules a statement to run when the current scope exits — no matter how it exits, including via an error. Pairing every alloc with a defer free right next to it is the habit that keeps memory honest. errdefer is its cousin: it runs only if the scope exits via an error, useful for undoing partial work.
Bulk memory: @memset and @memcpy
Two builtins do the obvious bulk operations fast: @memset(slice, value) fills every element, and @memcpy(dst, src) copies one slice into another of equal length. We use @memset(buf, 0) to zero a fresh buffer and @memcpy to copy rows around.
The arena: the trick that makes forward passes simple
Running a model forward creates a blizzard of temporary buffers — every layer produces intermediate activations we need only until the next layer consumes them. Freeing each one individually is tedious and error-prone. An arena allocator solves this elegantly: it hands out memory from a growing pool and frees the entire pool at once. You never call free on individual allocations; you just reset or deinit the arena when the work is done.
This split organizes the entire inference engine. Long-lived data — the model weights, the KV cache — lives in the general-purpose allocator and persists for the whole run. Scratch data — a layer's intermediate activations — comes from an arena that we reset after each layer or each token, reclaiming all of it in a single cheap operation. No per-buffer bookkeeping in the hot loop, no leaks, no fragmentation. We'll formalize this in Part 6, but the mental model starts now: weights are kept, activations are scratch.
Optionals: a value, or nothing
Sometimes a value is genuinely absent — a layer with no bias, a search that found nothing. Zig models this with optionals: ?T is "a T or null." You can't use it as a T until you've checked, which makes null-pointer mistakes a compile-time impossibility.
If you forget a free, the GeneralPurposeAllocator prints a leak report at deinit, complete with the allocation's stack trace. Treat every leak report as a real bug to fix now, not later. The discipline of "alloc here, defer free on the next line" makes leaks rare, and the arena makes whole categories of them impossible. This safety is a gift — lean on it while learning, before you ever touch ReleaseFast, where the checks are gone.
Write fn alloced_dot(a: std.mem.Allocator, x: []const f32, y: []const f32) !f32 that allocates a temporary buffer of products x[i]*y[i], sums it, frees it (with defer), and returns the sum. It's a deliberately over-engineered dot product — the point is to practice the alloc/try/defer/free rhythm until it's automatic. Then rewrite it to use an arena and notice how the free disappears.
You can allocate and free buffers, propagate and handle errors with try/catch, guarantee cleanup with defer, fill and copy memory in bulk, wield an arena for batch-freed scratch, and represent "maybe a value" with optionals. That is the entire memory toolkit an inference engine needs. Two more chapters of Zig and we start building kernels.
Structs, methods & comptime
~1–2 daysNow we package data. A struct groups related fields under one name, and functions defined inside it become methods. We'll also meet comptime — Zig's headline feature — which runs ordinary Zig code at compile time and is how the language does generics without a separate template language. By the end you'll have written a small Matrix type that is the direct ancestor of the Tensor we build in Chapter 9.
A struct with methods
Fields can have default values. Methods are just functions whose first parameter is the struct itself (by convention named self); you call them with dot syntax. Here's a 2-D matrix backed by the flat row-major slice you learned in Chapter 3:
.{ ... } you keep seeing.{ .field = value } is a struct literal whose type is inferred from context — here Zig knows the function returns Matrix, so it fills in the type. The same syntax with no field names, .{ a, b }, makes an anonymous tuple — which is exactly what you pass to std.debug.print as its argument bundle. One notation, two everyday uses: named fields for structs, positional for tuples.
Enums and tagged unions
An enum is a fixed set of named values — perfect for "which kind of thing is this." We'll use enums for activation types and data formats. A tagged union goes further: a value that is one of several variants, each carrying its own data. You won't need unions much, but enums show up constantly.
comptime: code that runs while compiling
Most languages have a wall between "compile time" and "run time." Zig blurs it. A value or parameter marked comptime is known and computed while the program is being compiled. This powers three things you'll use: compile-time constants, ensuring something is fixed (like a vector width), and — the big one — generics.
In Zig, a generic type is simply a function that takes a type at compile time and returns a type. That's the entire mechanism. std.ArrayList, the growable list we'll use for tokens, is exactly this:
Two payoffs. First, generics with zero runtime cost: Pair(f32) is as efficient as a hand-written float pair, because the compiler generates a specialized version — there's no boxing, no virtual dispatch. Second, specialization: later we can write one matrix-multiply routine generic over a numeric type or a quantization format, and the compiler produces a tight, dedicated loop for each, with sizes baked in as constants. It's the performance of copy-pasted code with the maintainability of one source. We lean on it lightly now and harder in Part 7.
Growable lists with ArrayList
Arrays and slices have fixed length. When you don't know the count ahead of time — the tokens in a sentence, the bytes of generated text — you want a list that grows. std.ArrayList(T) is that, and in the 0.16 line it's "unmanaged": you pass the allocator to each operation, which keeps the allocation strategy explicit and visible.
Create with .empty, then pass the allocator to each method: list.append(a, x), list.deinit(a), list.toOwnedSlice(a). The live elements are in list.items (a slice). This per-call allocator is a 0.15/0.16-era change; older tutorials show ArrayList.init(allocator) with allocator-free methods. If you see that style, you're reading older code — translate it by threading the allocator through, exactly as shown here.
Give the Matrix struct a method pub fn fill(self: Matrix, v: f32) void that sets every element to v, and pub fn sumRow(self: Matrix, r: usize) f32. Then make a generic fn Vec(comptime n: usize) type returning a struct with a fixed array [n]f32 and a dot method — your first taste of sizes-as-compile-time-values, which is how we'll specialize kernels later.
You can define structs with methods, model choices with enums, build generic types with comptime, and grow collections with ArrayList. You've written a Matrix with at and row — keep it; Chapter 9's Tensor is this idea, generalized. One chapter of Zig remains: turning loose files into a real, testable project.
The build system & tests
~½ daySo far we've used zig run file.zig on single files. A real project has many files and wants one command to build it and one to test it. Zig's build system is itself written in Zig — build.zig is a normal program that describes how to build your project. And testing is built into the language: any file can contain test blocks. This chapter sets up the skeleton we grow for the rest of the book, and the testing habit that keeps every kernel honest.
Scaffold the project
Make a folder and let Zig generate a starting point matched to your compiler version:
We'll grow src/ into a handful of focused files — tensor.zig, ops.zig, matmul.zig, and so on — each introduced when we need it. For now, here is a clean build.zig for an executable plus tests, in the 0.16 style. Treat it as a starting point and reconcile it against whatever zig init produced for your exact version.
The build API is one of the fastest-moving parts of Zig; the root_module/createModule shape shown here is current for 0.16 but differs from 0.13 and earlier. If this file doesn't compile, don't fight it — run zig init in a scratch folder and copy its structure, then graft on the run and test steps. The build script is the single place where version drift bites hardest, and "diff against a fresh zig init" is the reliable cure.
Splitting code across files
Each .zig file is a module. You pull one into another with @import("path.zig"), and anything marked pub is visible across the boundary. Here's main.zig using a kernel defined in ops.zig:
Build and run:
Tests live next to the code
A test "name" { ... } block is runnable, checkable documentation that sits in the same file as the code it checks. Inside, std.testing gives you assertions. For floating-point kernels — which is all of them — the essential one is expectApproxEqAbs, because exact equality is the wrong question for floats.
Floating-point arithmetic is not exact: 0.1 + 0.2 is not quite 0.3, and summing numbers in a different order gives slightly different results. So a kernel test never asks "is it exactly right?" — it asks "is it within a small tolerance?" expectApproxEqAbs(expected, actual, tol) with tol around 1e-6 is right for small computations; we'll loosen it to 1e-3 when comparing against an external oracle that uses different math internally. This single assertion is the backbone of every validation step in the book.
Every kernel we build gets a test that pins its behavior. This isn't ceremony — it's how you refactor fearlessly later. When Part 7 replaces a simple loop with a vectorized, multithreaded monster, the test that says "this still equals 32" is what tells you the rewrite is correct. Build the rope before you start the climb, and add to it at every ledge.
Add pub fn axpy(out: []f32, a: f32, x: []const f32) void to ops.zig — it computes out[i] += a * x[i] (the "a-x-plus-y" of classic linear algebra). Write two tests: one numeric, one checking that an all-zero a leaves out unchanged. Run zig build test and watch them pass. You now have the full loop: write a kernel, pin it with a test, build green.
You have a real project: zig build run runs it, zig build test tests it, code is split across modules, and your first kernel (dot) is written, homed in ops.zig, and covered by passing tests. You came in not knowing Zig; you leave able to structure and verify real programs in it. Now we use that to build the mathematics of a neural network — starting, fittingly, with the dot product you keep bumping into.
Kernels
The numerical bedrock. Five chapters that build, test, and explain every arithmetic primitive a transformer needs: the dot product, the neuron, matrix multiply, the activation functions, softmax, and the normalizers. Nothing here is specific to language models yet — these are the universal kernels — but by the end you'll run a small neural network's forward pass entirely in your own Zig.
Vectors & the dot product
~½ dayIf you learn one operation in this book, learn this one. The dot product is the atom of neural networks: a linear layer is dot products, attention scores are dot products, the final word-prediction is dot products. It is also delightfully simple — multiply two lists of numbers elementwise and add up the results — which means once you can compute and interpret it, an enormous amount of the machine stops being mysterious.
The definition
For two vectors a and b of the same length, the dot product is a·b = a₀b₀ + a₁b₁ + … + aₙ₋₁bₙ₋₁. One number out, however long the vectors. You already wrote it:
The std.debug.assert documents and enforces the one precondition — equal lengths — and, like bounds checks, it's active in Debug and gone in release builds.
The dot product measures alignment. Geometrically, a·b = |a| |b| cos θ, where θ is the angle between the vectors. So it's large and positive when the vectors point the same way, near zero when they're perpendicular (unrelated), and negative when they point oppositely. Every time a transformer asks "how relevant is this thing to that thing?", it is, underneath, taking a dot product and reading it as a similarity score. Hold onto that sentence — it's the whole intuition behind attention.
Similarity, made concrete
Normalize the dot product by the vectors' lengths and you get cosine similarity, a pure measure of direction in [-1, 1]. This is exactly how you'd ask "are these two word embeddings similar?" — the kind of thing the model's embedding table (Chapter 19) encodes. Let's build it and feel it:
king and queen point almost the same direction (high similarity); king and taco diverge (low or negative). Real embeddings have hundreds or thousands of dimensions instead of three, but the operation and its meaning are identical. You just did, by hand, the thing that makes "word vectors" famous.
The dot product wears two hats throughout the book, and noticing which one is in play keeps you oriented. As a weighted sum: when a is an input and b is a neuron's learned weights, a·b is "how strongly this neuron responds" — the engine of linear layers (Chapter 8–9). As a similarity: when a and b are two vectors of the same kind, a·b scores how alike they are — the engine of attention (Chapter 20). Same arithmetic, two jobs.
Add a test that cosineSim(v, v) is 1.0 for any non-zero v (a vector is perfectly aligned with itself), and that two perpendicular vectors like {1,0} and {0,1} give 0.0. These invariants are cheap to check and catch a surprising number of indexing mistakes.
Implement the classic "king − man + woman ≈ queen" analogy at tiny scale: make up a few 3-D vectors, compute result = king − man + woman (you'll need a vector subtract and add — reuse addInto from Chapter 3), then find which of your other vectors has the highest cosine similarity to result. It won't be profound with hand-picked numbers, but the code is exactly what runs on real embeddings.
You have dot, norm, and cosineSim, all tested, and — more importantly — you can read a dot product as either a weighted sum or a similarity. That dual intuition is worth more than any single line of code in this book. Next: wrap a dot product in a bias and an activation, and you have a neuron.
The neuron
~½ dayA neuron is almost embarrassingly simple: take a dot product of the inputs with some learned weights, add a bias, and pass the result through an activation function. That's it. A whole neural network is just a great many of these, arranged in layers. In this chapter we build one neuron, then a full layer of them, and meet the reason activations exist at all.
One neuron
Ignore the activation for a second: dot(x, w) + b is the equation of a hyperplane. The weights w are a direction in input space, and the neuron measures how far along that direction the input lies, shifted by the bias b. The activation then bends this: ReLU, for instance, zeroes out everything on one side of the plane. So a neuron carves the input space with a plane and responds to one side of it. Stack enough of these and you can carve out astonishingly complex regions — which is, hand-wavingly, why neural networks can represent complex functions.
A layer is many neurons — i.e., a matrix
A dense (or "linear", or "fully-connected") layer is just n_out neurons, each looking at the same input x with its own row of weights. Stack those rows and you have a weight matrix W of shape [n_out, n_in]: row i is neuron i's weights. Computing the layer is computing one dot product per output — which is exactly a matrix-times-vector product.
We store linear weights as [n_out, n_in] row-major: each row holds all the weights for one output neuron, laid out contiguously. This is the PyTorch nn.Linear convention, which means when we load real GPT-2 and Gemma weights later, the bytes line up and "computing the layer" stays "one dot product per row." It also makes the inner loop cache-friendly — a neuron's weights stream straight through memory. (GPT-2 has one twist on this we'll handle in Chapter 26; note the convention now and that twist will make sense.)
Why activations: without them, depth is a lie
Suppose you stack two linear layers with no activation between them: y = W₂(W₁x). By the rules of matrix multiplication that's just y = (W₂W₁)x — a single linear layer with a combined weight matrix. Ten stacked linear layers collapse into one. Depth buys you nothing unless you put a nonlinear function between the layers. The activation is what breaks the collapse and lets each layer build on the last. That's the entire reason ReLU, GELU, and friends exist.
Notice linearVec itself is purely linear — the activation is applied separately, after. That separation matches how real models are structured (a "Linear" module and an "activation" are distinct steps) and keeps each kernel doing one job, which makes them easy to test and reuse.
The comment above computes the expected outputs by hand: {-0.5, 4.0}. Turn that into a test with expectApproxEqAbs for each element. Hand-computing one small case and asserting it is the most reliable way to know a kernel is right — and it makes the layout convention (rows = neurons) concrete in your fingers.
Write fn reluInto(xs: []f32) void that applies ReLU in place (reuse the |*x| pattern from Chapter 3), then run a layer followed by reluInto and confirm neuron 0's -0.5 becomes 0. You've now built the two halves of every dense layer — the linear part and the nonlinearity — as separate, testable kernels.
You can compute a single neuron and a full dense layer over a vector, you store weights in the [n_out, n_in] convention that real models use, and you understand why a nonlinearity must sit between layers. The only thing standing between you and arbitrary networks is doing this for many inputs at once — the general matrix multiply, next.
Matrices & matmul
~1–2 daysSo far a layer processes one input vector. But a transformer processes a whole sequence at once — many token vectors stacked into a matrix. Applying a linear layer to all of them is a matrix–matrix multiply, the single most important and most expensive operation in the entire model. In this chapter we give our data a proper home — the Tensor type — and write the matmul that everything else stands on.
The Tensor type
A tensor is nothing fancy: a flat f32 buffer plus a shape. We allow up to four dimensions (enough for everything we'll do), store them outermost-first, and keep the data row-major and contiguous. This is the same design used by serious implementations — simple on purpose.
data: undefined is okay hereundefined means "this memory is intentionally uninitialized — don't read it yet." We set data two lines later with the real allocation, so the placeholder is never observed. In Debug builds Zig even fills undefined memory with the byte 0xAA so that if you do accidentally read it, the garbage is recognizable rather than a sneaky zero. Use undefined deliberately for "about to be filled," never as a lazy default.
Matrix multiply, as a stack of dot products
Here's the operation. We have an input x of shape [M, K] — M token vectors, each of width K — and a weight matrix W of shape [N, K] in our "row = neuron" convention. The output y is [M, N]: for every input row and every neuron, one dot product. This is the batched version of Chapter 8's linearVec.
The rule for any matmul is "inner dimensions must match, outer dimensions survive." Here x is [M, K] and we treat W as if transposed to [K, N], so the shared K cancels and the result is [M, N]. Because we physically store W as [N, K] (rows = neurons), we never actually transpose anything — we just iterate weight rows. Keeping M (tokens), N (output features), and K (input features) straight in your head is most of what it takes to wire a transformer correctly.
How much work is this, really?
Each output element costs K multiply-adds, and there are M × N outputs, so a single linear layer is about 2 · M · N · K floating-point operations. For GPT-2's feed-forward layer on a 100-token prompt that's already hundreds of millions of operations — per layer, times twelve layers, times every token you generate. This one kernel will dominate the model's entire runtime, which is why Part 7 (and the companion Whisper book) pour all their optimization energy here. For now, correct-and-simple beats fast; we'll earn the speed later.
From here on, the fastest way to trust a kernel is to compare it against NumPy. Three lines — x @ W.T + b — give you ground truth for linear. Print your Zig output and the NumPy output for the same inputs and eyeball them, or dump NumPy's result to a file and load it in a test. Appendix D shows the full workflow; adopting it now, on a kernel simple enough to also check by hand, means you'll trust it when the kernels get too big to check by hand.
Write fn matmulNaive(y, a, b, M, N, K) that computes a plain matrix product A[M,K] · B[K,N] (B stored normally, not transposed) with three nested loops. Then convince yourself that our linear is the same thing with B = Wᵀ folded in. Understanding both forms — and why we prefer the transposed-weight one for cache locality — pays off when you read other people's model code.
You have a Tensor type and a tested linear kernel — the workhorse of every model in this book. You can push a batch of token vectors through a dense layer and you know, roughly, how expensive it is. Next we add the functions that go between linear layers: activations, softmax, and the normalizers.
Activations, softmax & norms
~2 daysBetween the linear layers sit the functions that give a network its shape: nonlinear activations, the softmax that turns scores into probabilities, and the normalizers that keep numbers in a sane range so deep stacks stay stable. These are small kernels with outsized importance — and softmax hides a numerical trap that has bitten everyone at least once.
Activations
Three you'll meet. ReLU (max with zero) is the simple classic. GELU is a smooth version that GPT-2 uses — and there are two formulas for it that differ slightly, which matters when matching a reference. SiLU (also called Swish) is what modern models' gated MLPs use.
There's an "exact" GELU defined with the error function erf, and the tanh approximation above. They differ by about 1e-3 — invisible to the final text, but enough to fail a tight numerical test against a reference that used the other one. GPT-2's and Gemma's configs both ask for the tanh form, so that's our default. The lesson generalizes: when matching an oracle, match its exact formula, not a mathematically-equivalent-looking one.
Softmax — and the overflow that ruins it
Softmax turns a vector of arbitrary scores ("logits") into a probability distribution: every output is positive and they sum to 1. The definition is softmax(x)ᵢ = exp(xᵢ) / Σ exp(xⱼ). Written naively, that exp is a landmine: if any logit is, say, 100, then exp(100) overflows f32 to infinity and the whole thing becomes NaN. The fix is a classic trick: subtract the maximum logit first. It changes nothing mathematically (the same constant cancels top and bottom) but keeps every exp argument ≤ 0, so the largest term is exp(0)=1 and nothing overflows.
exp(xᵢ - m) / Σ exp(xⱼ - m) = (exp(xᵢ)/exp(m)) / (Σ exp(xⱼ)/exp(m)) — the exp(m) cancels top and bottom, leaving the original softmax. So shifting by any constant gives the same result; shifting by the max is the choice that makes the biggest exponent zero, guaranteeing no overflow and minimal rounding error. This exact kernel runs inside every attention head, so its stability is not optional.
Normalizers: LayerNorm and RMSNorm
Deep networks need their activations kept on a stable scale, or the numbers drift and training/inference degrade. Two normalizers do this, each applied per row (per token vector). LayerNorm (GPT-2) centers and scales a row to mean 0, variance 1, then applies a learned scale γ and shift β. RMSNorm (Gemma, Llama) is a cheaper cousin that skips the centering and divides by the root-mean-square, with just a learned scale.
Both normalizers answer "rescale this vector to a consistent magnitude so the next layer sees well-behaved numbers." LayerNorm subtracts the mean (centering) and divides by the standard deviation; RMSNorm skips the mean entirely and divides by the root-mean-square. RMSNorm turns out to work just as well in practice while doing less arithmetic and storing fewer parameters, which is why the field drifted from LayerNorm (GPT-2, 2019) to RMSNorm (Llama and Gemma, 2023+). Swapping one for the other is one of the concrete steps in Part 7 — and you'll have both kernels ready.
LayerNorm and RMSNorm are exactly where subtle bugs hide — a wrong epsilon, dividing by n-1 instead of n, forgetting the learned scale. Compare a few rows against torch.nn.LayerNorm and torch.nn.RMSNorm (or a two-line NumPy version) at tolerance 1e-5. Getting the normalizer subtly wrong produces a model that's "almost right" — fluent but slightly off — which is the most maddening bug class there is. Pin it now.
Write a temperature variant of softmax: divide every logit by a constant t > 0 before the softmax. Confirm that t → 0 makes the distribution spiky (all mass on the max) and large t makes it flat (uniform). You've just built the temperature knob that controls how "creative" generated text is — we'll wire it into sampling in Chapter 19.
Your kernel library is complete: dot, linear/matmul, ReLU/GELU/SiLU, a stable softmax, LayerNorm, and RMSNorm — every one tested. There is no arithmetic in a transformer that you can't now perform. Time to assemble these pieces into an actual neural network and run a forward pass end to end.
A tiny MLP
~1 dayTime to assemble. A multi-layer perceptron — the simplest "deep" network — is just two linear layers with an activation between them. We'll build one as a struct, fill it with random weights, and run a forward pass end to end. The output will be meaningless (random weights compute random functions), but the machine will run: data will flow in, get transformed layer by layer, and come out the other side. That working pipeline is the Part 2 prize.
The MLP, as a struct
It owns four buffers: two weight matrices and two bias vectors, in the [n_out, n_in] convention. We initialize them with small random numbers — the standard way to start an untrained network.
Notice forward allocates the hidden buffer and the output from whatever allocator you hand it. Pass it an arena (Chapter 4) and all those temporaries vanish with one reset when you're done with the result — no per-buffer frees in sight. This is the exact pattern the real inference loop uses: weights live in the long-lived allocator, every activation comes from scratch. You're already writing inference-engine-shaped code.
Run it
Run it and you'll see three probabilities that sum to 1. They're near-uniform and meaningless — the network has learned nothing — but every kernel you built just ran in sequence: two matrix multiplies, a GELU, and a softmax, with data threaded cleanly through. That is a neural network forward pass. A GPT is this same shape, only the layers are fancier and there are more of them.
What you built is inference: given weights, compute outputs. It's the entire second half of this book (Parts 5–8 are elaborate forward passes). The missing half is where the weights come from — random noise is useless; real weights encode knowledge. There are two ways to get good weights: train them (Part 3 — you'll do it for a small net) or load them from a model someone already trained (Part 6 onward — how you'll run GPT-2 and Gemma). Both start from the forward pass you just wrote.
Generalize forward to process a batch of M input vectors at once: take x as [M, in] and produce [M, out] by passing M as the first argument to linear instead of 1. (You'll need to apply GELU across the whole hidden buffer, which geluInto already does.) Batching is how you'd score many examples — or many tokens — in one call, and it's how the transformer processes a sequence.
A neural network's forward pass runs entirely in your Zig, built from kernels you wrote and tested: dot, matmul, activations, softmax, norms, assembled into an MLP. You've proven you can run a network. Part 3 answers the question that's been hanging over the random weights — where do good weights come from? — by teaching the network to learn.
How machines learn
A short, self-contained detour to answer the question Part 2 left open: where do good weights come from? We'll fit a line by gradient descent, meet the cross-entropy loss that every language model is trained with, build a tiny automatic-differentiation engine, and use it to train a real (if small) network. You don't strictly need any of this to run GPT-2 — but after it, the pretrained weights you load later will never feel like magic numbers again.
Gradient descent
~1 dayLearning is just this loop: make a prediction, measure how wrong it is with a single number called the loss, figure out which way to nudge each weight to make that number smaller, and take a small step in that direction. Repeat thousands of times and the weights drift to values that make good predictions. We'll do it on the simplest possible model — fitting a straight line — where you can watch every number and even check the math by hand.
The setup: fit a line
We have data points that roughly follow y = 2x + 1, and a model ŷ = w·x + b with two unknowns. We want the w and b that fit best. "Best" means smallest mean squared error: the average of (ŷ − y)² across the data.
The loss is L = (1/n) Σ (w·xᵢ + b − yᵢ)². Its gradient is the pair of partial derivatives telling us how L changes as we wiggle w or b. Calculus (the chain rule on a square) gives: ∂L/∂w = (2/n) Σ (ŷᵢ − yᵢ)·xᵢ and ∂L/∂b = (2/n) Σ (ŷᵢ − yᵢ). Read them intuitively: the error (ŷ − y) drives both; for w it's weighted by the input x (inputs that were large share more blame), for b it's just the average error. The gradient points uphill, so to descend we step the opposite way.
The training loop
Run it and watch the loss fall toward zero while w climbs to ≈2 and b to ≈1. Nobody told the program the answer; it found it, one small downhill step at a time. That is the entire principle behind training a 100-billion-parameter model — the same loop, with billions of weights and a fancier prediction in place of w·x + b.
Too small and learning crawls; too large and the steps overshoot the valley and the loss explodes to infinity or NaN. Try lr = 0.5 in the code above and watch it blow up; try 0.001 and watch it barely move. Picking the learning rate is the most consequential knob in all of training, and "the loss went to NaN" almost always means it was too high. We won't tune it much (we mostly load pretrained weights), but you should feel its effect once, firsthand.
You can verify a hand-derived gradient without trusting your calculus: nudge a weight by a tiny h, recompute the loss, and (L(w+h) − L(w−h)) / (2h) should match your analytic ∂L/∂w. This "finite-difference gradient check" is the single most useful debugging tool in all of machine learning, and it's how we'll validate the autograd engine in Chapter 14. Add it here on the line-fitting loss and confirm the two agree to a few decimals.
Add a bit of noise to the targets (e.g. ys[i] += (rand−0.5)) and confirm the model still recovers w≈2, b≈1 — learning is robust to noise. Then add a second feature so the model is ŷ = w₁x₁ + w₂x₂ + b and derive/implement its two weight gradients. You've just generalized from a line to a plane — and a linear layer is exactly this, with thousands of weights.
You've watched a model learn: loss down, weights converging to the truth, driven only by gradients and small steps. You can derive a simple gradient and check it numerically. Next we swap the regression loss for the one language models actually use — cross-entropy — and learn to predict categories instead of numbers.
Cross-entropy & classification
~1 dayPredicting a number (regression) and predicting a category (classification) need different losses. A language model is a classifier: at each step it picks the next token out of tens of thousands of possibilities. The right loss for that is cross-entropy, and it has a gradient so clean it feels like a gift. This is the most directly LLM-relevant chapter in Part 3 — cross-entropy is precisely what GPT-2 was trained to minimize.
From logits to a loss
The model emits a vector of raw scores — logits — one per class. Softmax turns them into probabilities. The cross-entropy loss is simply the negative log-probability the model assigned to the correct class: L = −log(p_correct). If the model was confident and right, p_correct is near 1 and the loss near 0. If it was confidently wrong, p_correct is near 0 and −log of that is huge. The loss punishes confident mistakes brutally and rewards confident correctness — exactly the pressure you want.
p − yHere's the gift. Differentiate cross-entropy-after-softmax with respect to the logits and almost everything cancels: ∂L/∂logitᵢ = pᵢ − yᵢ, where p is the softmax output and y is the one-hot target (1 at the correct class, 0 elsewhere). In words: the gradient on each logit is "the probability you gave it, minus whether it was actually correct." Too much probability on a wrong class → positive gradient → push it down. Too little on the right class → negative gradient → push it up. No other loss/activation pairing is this tidy, which is a big reason softmax+cross-entropy is universal.
Train a classifier with it
A linear classifier predicts logits = W·x + b with one weight row per class. With the gradient above, the per-class weight update is ∂L/∂(row c) = (pᶜ − yᶜ)·x — the same outer-product shape as the line-fitting gradient, just vectorized. Here's the loop on a tiny two-feature, two-class dataset:
Replace "2 classes" with "50,257 classes" (GPT-2's vocabulary) and "the input point" with "the model's summary of the text so far," and you have exactly how a language model is trained. The next token is the correct class; the logits come from the transformer; the loss is this same cross-entropy; the gradient is this same p − y. Everything in Parts 5–7 is machinery for producing better logits — the loss at the very end is the humble function you just wrote. Training GPT-2 was running this loop over a third of a trillion tokens of internet text.
Language-model quality is often reported as perplexity, which is just exp(cross-entropy). A perplexity of 20 means the model is, on average, as unsure as if it were choosing uniformly among 20 words. Lower is better. It's the same loss you computed, exponentiated into a more intuitive "effective number of choices." Print @exp(total/4) in the loop above to watch perplexity collapse toward 1 as the classifier becomes certain.
Add a third class and a few more points, and confirm the loop still drives the loss down. Then feed it points that are not linearly separable (e.g. an XOR layout) and watch the linear classifier fail — it can't draw a straight boundary. That failure is the motivation for hidden layers and the autograd engine that can train them, which is the next two chapters.
You can compute cross-entropy stably, you know its gradient is the elegant p − y, and you've trained a classifier with it — the very loss and gradient that train every LLM. But hand-deriving gradients stops scaling the moment networks get deep. The fix is to make the computer derive them for us: automatic differentiation, next.
Autograd from scratch
~2–3 daysDeriving gradients by hand worked for a line and a linear classifier. For a deep network with thousands of weights it's hopeless. The breakthrough that made modern deep learning practical is automatic differentiation: you write only the forward computation, and the machine figures out every gradient by mechanically applying the chain rule backward through the operations. We'll build a tiny but real autograd engine — the same idea as PyTorch's, in a couple hundred lines of Zig.
The idea: record a graph, replay it backward
Every arithmetic operation becomes a node in a graph that remembers its inputs and the local rule for its own derivative. Running the forward computation builds this graph. Then backpropagation walks the graph in reverse: starting from the loss with a seed gradient of 1, each node hands its inputs their share of the gradient using the chain rule. Because a node is always created after its inputs, reverse creation order is a valid order to process them.
The graph owns its nodes
We keep all nodes in an arena and a creation-ordered list. Building an expression appends nodes; backward zeroes every grad, seeds the output with 1, and replays in reverse.
PyTorch, JAX, TensorFlow — under the marketing, each is a more elaborate version of exactly this: a way to record the operations of a forward pass and replay them backward to get gradients. Add more operation types (each with its local derivative), make the nodes hold tensors instead of single floats, run it on a GPU, and you have a real framework. The conceptual core is the ~80 lines above. Once you've built it, "autograd" stops being a black box and becomes a thing you understand completely.
Prove it works: chain rule, then a gradient check
Build a small expression, backpropagate, and read the gradients straight off the nodes:
The forward values are easy to trust; the backward pass is where bugs hide. Verify it the way every framework's test suite does — against finite differences. For each input, nudge its data by a tiny h, rebuild and recompute the output, and confirm (out₊ − out₋)/(2h) matches the .grad the engine produced, to ~1e-3. If they agree across a few random expressions, your chain-rule wiring is correct. This single test is worth more than any amount of staring at the code.
Gradients accumulate with += — that's deliberate, so a value used in several places sums its contributions correctly. But it means you must reset every node's grad to 0 before each new backward, which our backward does at the top. Forgetting this is the most common autograd bug (it's why PyTorch users call optimizer.zero_grad() every step). Leave grads un-zeroed and your updates blend stale gradients from previous steps, and training quietly degrades.
Add a pow or an exp node: store any extra data it needs, compute the forward value, and add its local-derivative case to backwardLocal (for exp, the derivative is the output itself; for xⁿ, it's n·xⁿ⁻¹). Gradient-check it. With exp and log added you could even route the Chapter 13 cross-entropy through the engine — automatic gradients for the LLM loss itself.
You built a working automatic-differentiation engine — record forward, replay backward — and gradient-checked it. The thing that powers every deep-learning framework is no longer mysterious; it's code you wrote. One chapter left in this detour: point the engine at a real network and watch it learn something a linear model can't.
Train a tiny network
~1–2 daysFinal stop on the learning detour. We'll train a real network — small, but with a hidden layer — on the XOR problem: output 1 when exactly one input is 1, else 0. XOR is famous because no straight line separates its classes, so a linear model (Chapter 13) cannot solve it. A network with one hidden layer can, and we'll watch it discover the solution using nothing but the autograd engine you built.
Weights live outside the graph
The key structure: the weights are persistent plain f32 arrays that survive the whole run, while the computation graph is rebuilt every step and reset. Each step we wrap the current weights as leaf Values, build the forward pass and loss, backpropagate, then copy each leaf's gradient back out to update the persistent weight. That separation — persistent parameters, disposable graph — is exactly how PyTorch's Parameter and dynamic graph relate.
Run it. The loss starts near 1 and falls toward zero over a few thousand epochs, and the final predictions snap to roughly 0, 1, 1, 0. The network found a nonlinear boundary that a line never could — and it did so entirely through gradients your engine computed automatically. There is no XOR-specific code anywhere; you specified only the forward computation and the loss.
Step back and notice you've built every conceptual piece of how models are trained: a forward pass, a loss, automatic gradients, and an update rule. Scale this up — millions of weights, a transformer in place of the 2-2-1 net, cross-entropy over a vocabulary in place of XOR's squared error, and a smarter update rule than plain gradient descent (Adam) — and you have the recipe that produced GPT-2 and Gemma. The recipe is the same. Only the scale and the engineering differ.
You trained a neural network from scratch in Zig: it learned XOR, a problem provably beyond linear models, using your own autograd. The detour is complete and it paid off — when you load GPT-2's pretrained weights in Part 6, you'll know exactly what they are: the end state of this same loop, run at an unimaginable scale. The weights will be given, but they will never again be magic.
A note on the road ahead: from here we return to the inference path and stay there. We will not train the big models — that needs hardware and time we're not assuming you have, and the pretrained weights are freely available. Everything from Part 4 onward is about building the forward pass of real language models and running them. You now understand both halves; the rest of the book builds the impressive one.
Language from scratch
Now we point the machinery at language. Four chapters that build the ideas every text model shares — what a language model even is, the humblest one (the bigram), the byte-pair tokenizer GPT-2 actually uses, and the embedding-and-sampling toolkit that the transformer plugs into. By the end you'll be generating text from a model you trained, and holding the exact tokenizer GPT-2 reads.
The next-token game
~½ dayStrip away the hype and a language model plays one game: given some text so far, guess what comes next. Formally, it models the probability of the next token given the previous ones, P(next | context). That's it. Generation is just playing the game on a loop — predict the next token, append it, predict again. Everything else in this book is about making that single prediction better. We start by deciding what a "token" is.
Tokens: chunks of text the model sees
A model doesn't see letters or words; it sees token IDs — integers. A tokenizer defines the mapping between text and IDs. The simplest possible choice, and the one we'll use to learn, is character-level: each distinct character is a token. Real models use a cleverer scheme (byte-pair encoding, Chapter 18), but characters make every idea visible with zero machinery.
Character vocab: tiny (dozens of tokens), but the model must learn long-range spelling and the sequences are long. Word vocab: short sequences, but the vocabulary is huge and can't handle unseen words. Real models pick a sweet spot in between — subword pieces — giving a vocabulary of ~50,000 (GPT-2) to ~256,000 (Gemma) tokens. The vocab size is literally the number of output classes the model predicts over, so it sets the width of the final layer and a chunk of the model's parameters. We'll feel this tradeoff concretely in Chapter 18.
The data is the text, shifted by one
Here's the quietly profound part: training data for "predict the next token" is free. Take any text, tokenize it, and every position is a training example — the input is the tokens up to here, the target is the very next token. The whole of "hello" gives you h→e, he→l, hel→l, hell→o. No human labeling needed; the text labels itself. That's why language models can train on essentially the entire internet — every sentence is its own answer key.
Write a function that, given encoded ids, prints every (context, target) pair for a fixed context length of 1 — i.e. every adjacent pair (ids[i], ids[i+1]). Then count, for the corpus "hello world", how many times each (current char → next char) pair occurs. You're one step from a working language model: those counts are a bigram model, which is the next chapter.
You can turn text into token IDs and back, and you understand the next-token game and why its training data is self-labeling. You've reframed "language model" from something mystical into something concrete: a next-token probability, played on a loop. Now let's build the simplest model that plays it.
The bigram model
~1 dayThe simplest language model that actually generates text is the bigram: predict the next token from the current one alone, ignoring everything before it. It's far too forgetful to be good, but it's a complete, working language model you can build in an afternoon — and it teaches sampling, the generation loop, and (in its neural form) the bridge from Part 3's training to language. We'll build it two ways: by counting, and by learning.
Version 1: just count
Walk the text and tally how often each token follows each other token into a V×V table. Normalize each row to probabilities and you have P(next | current). No training, no gradients — pure bookkeeping.
Train it on a paragraph of English and sample: you'll get pronounceable nonsense — "thequsomed ther" — because the model knows which letters tend to follow which, but nothing about words or meaning. That's the bigram's ceiling, and seeing it makes vivid why we need models that look back further than one token.
Notice we sample from the distribution rather than always taking the most likely next token. If a bigram model always took the argmax it would loop forever on the single most common transition ("e→ ", " →t", "t→h"...). Sampling — rolling a weighted die over the probabilities — is what gives generation its variety. This same choice, sample vs. take-the-max, is the heart of the decoding strategies (temperature, top-k, top-p) we build in Chapter 19. The bigram is where you first feel why randomness is a feature, not a bug.
Version 2: learn the same thing with gradients
Here's the bridge to everything ahead. Instead of counting, give the model a V×V table of learnable numbers — logits — where row cur holds the scores for each possible next token. Predict with softmax, score with cross-entropy, and improve with the p − y gradient from Chapter 13. This "neural bigram" learns essentially the same distribution as counting, but by gradient descent — and that row of logits indexed by the current token is an embedding table, the very first component of a real transformer.
After training, take softmax(W[cur]) as the next-token distribution and sample exactly as before. Compare its output to the count-based model's: nearly identical, because for a bigram both methods are estimating the same conditional probabilities. The difference is that the gradient version generalizes — swap the lookup row for "a transformer's output given all previous tokens" and the very same loss and update train GPT-2.
In the neural bigram, "row cur of W" is selected by the current token id — a pure lookup, no matrix multiply. That's exactly what a token-embedding table is: a [vocab, dim] matrix where token i grabs row i as its vector. The bigram's twist is that its "embedding" is already the logits. Real models embed into a rich hidden space, transform it through many layers, and only then project to logits — but the first move, "token id indexes a row," is identical. You've just built the embedding lookup without realizing it.
Extend to a trigram: predict from the previous two tokens by indexing a V×V×V table (or, for the neural version, a table indexed by a pair). Watch the samples get noticeably more word-like — and watch the table size explode as Vⁿ. That explosion is precisely why we don't just use bigger n-grams; instead, the transformer learns to summarize arbitrary-length context into a fixed-size vector. The n-gram's failure is the transformer's reason to exist.
You built a real language model two ways, generated text from it, and saw the bigram's one-token memory as the limitation that motivates everything next. You also met the embedding table in disguise and reused the exact cross-entropy training from Part 3 on language. Before we give the model real memory with attention, we need the tokenizer real models actually use — byte-pair encoding.
Byte-pair encoding (the GPT-2 tokenizer)
~3–5 daysCharacter tokens make sequences too long; word tokens make the vocabulary impossibly large and choke on unseen words. Byte-pair encoding (BPE) threads the needle: it learns a vocabulary of frequent subword pieces, so common words become single tokens while rare ones split into parts — and because it works at the byte level, it can represent literally any input. This is the exact tokenizer GPT-2 uses, and getting it right is what lets us feed real prompts to a real model in Part 6.
How BPE builds its vocabulary
BPE starts with a base alphabet (for GPT-2, the 256 possible bytes) and repeatedly finds the most frequent adjacent pair of pieces in a big corpus and merges it into a new piece. Do this 50,000 times and you have a vocabulary where " the" is one token but a typo is a handful of bytes. The training produces an ordered list of merges; the order is the priority (rank) used at encode time. We don't retrain BPE — GPT-2 ships its learned merges — we just apply them.
A GPT-2 tokenizer is two data files. vocab.json maps each token string to its integer id (50,257 of them). merges.txt lists the learned merges in priority order — line n is the n-th most important merge, e.g. "t h" early on. Encoding text means: break it into bytes, then repeatedly apply the highest-priority merge that still applies until no more do. Decoding is the easy direction — just look up each id's bytes and concatenate. The whole tokenizer is those two tables plus the merge loop.
The byte-level trick
GPT-2's BPE operates on text after mapping every raw byte to a "safe" printable Unicode character — so that spaces, newlines, and control bytes have visible stand-ins and never break the vocabulary's string handling. It's a reversible relabeling of the 256 byte values. You apply it before encoding and reverse it after decoding.
The practical upshot, and a gift from the people who convert these models: most model files store each token's already-decoded raw bytes, so you can skip the unicode dance on the decode side entirely and just concatenate bytes. We'll lean on that when loading GPT-2 in Part 6.
Encoding: greedily apply the best merge
This is the heart of the tokenizer. Start with the text as a list of single-byte pieces, then loop: find the adjacent pair with the best (lowest) merge rank, fuse it, and repeat until no adjacent pair is in the merge table. The remaining pieces map to token ids.
Decoding: concatenate bytes, then read UTF-8
Going from ids back to text is delightfully simple — append each token's raw bytes into one buffer and interpret the whole thing as UTF-8 at the end.
A single character — an emoji, an accented letter, a CJK glyph — is often split across several BPE tokens, so one token's bytes can be an incomplete UTF-8 sequence. If you try to print each token as text the moment it's generated, anything non-ASCII turns to mojibake. Accumulate all the bytes first, then decode. (When streaming output to a user, buffer until you have a complete UTF-8 boundary.) This bites everyone once; now it won't bite you.
Before BPE, GPT-2 splits text with a specific regular expression that keeps leading spaces attached to words and isolates punctuation and digits. Encode without it and your token ids won't match GPT-2's for the same text — close, but off, which corrupts results subtly. The exact pattern is in Appendix C. Implementing full regex is beyond our scope, so the pragmatic move (Part 6) is to either port the pattern or load a tokenizer file that bakes it in. For learning the mechanism, the merge loop above is the part that matters.
Two checks make a tokenizer trustworthy. First, round-trip: decode(encode(s)) == s for any ASCII string — if this fails, your byte handling is wrong. Second, parity: your ids for a sentence must match a reference tokenizer's ids exactly. Dump a few from Python — tiktoken or transformers — and compare (Appendix D). Token parity is non-negotiable: even one wrong id feeds the model a different prompt than you think, and it will confidently produce wrong text.
You understand byte-pair encoding end to end — the merge-building, the byte-level trick, the greedy encode, the concatenate decode — and you know the two gotchas (UTF-8 boundaries, the pre-tokenization regex) that separate "almost right" from "matches GPT-2 exactly." You can now turn real prompts into the exact token ids GPT-2 expects. One more toolkit chapter — embeddings and sampling — and we build the transformer.
Embeddings & sampling
~1 dayTwo bookends of every transformer. At the front, embeddings turn token ids into the vectors the model actually computes on. At the back, sampling turns the model's output logits into the next token. Both are model-agnostic — you'll reuse this exact code for GPT-2, Gemma, and Whisper — so we build them once, here, and never rewrite them.
The input: token embedding + position
A token id is just an index; the model needs a vector. The token embedding is a [vocab, d_model] table — token t grabs row t as its starting vector (the lookup you met in the neural bigram). But attention alone is order-blind, so we must also inject where each token sits. GPT-2 does this with a learned positional embedding table, [max_pos, d_model], added on top. (Modern models encode position differently — RoPE, Chapter 30 — but the "embed, then add position" shape is universal.)
The embedding table isn't arbitrary — training shapes it so that tokens used in similar ways end up with similar vectors (remember cosine similarity, Chapter 7). Directions in this space come to encode real structure; the famous "king − man + woman ≈ queen" lives here. When you load GPT-2's weights, the embedding table is the distilled statistical "meaning" of every token, learned from billions of words. The model's first act on your prompt is to look up these learned vectors.
The output: from logits to a token
The model's final layer produces logits — one score per vocabulary token. A decoding strategy turns that into a choice. There's a spectrum from boring-and-safe to wild-and-creative, controlled by a few knobs you'll expose on the command line later.
Greedy — always take the highest-scoring token. Deterministic, repetitive, and prone to loops, but useful for debugging because it's reproducible.
Temperature — divide the logits by T before softmax. T < 1 sharpens (more confident, closer to greedy); T > 1 flattens (more random); T = 1 is the model's own distribution. This is the master creativity dial.
Top-k — keep only the k highest-scoring tokens, zero the rest, then sample. Prevents the model from ever picking something absurd from deep in the tail.
Top-p (nucleus) — keep the smallest set of tokens whose probabilities sum to at least p (say 0.9), then sample from those. Unlike top-k's fixed count, the set size adapts: narrow when the model is confident, wide when it's unsure. Top-p is the most popular default in practice.
Zig's std.sort.block takes a context value, the slice, and a comparison function. Here the context is the logits slice and the items are indices, so we sort indices by the logit they point at — leaving logits itself untouched until we mask the tail. The inline struct { fn gt(...) }.gt is Zig's idiom for a one-off function value. (For top-p, sort the same way, walk the sorted probabilities accumulating until you pass p, and mask everything after.)
The same model, same weights, feels completely different under different decoding. Greedy GPT-2 loops and drones; temperature-0.8 top-p-0.9 GPT-2 is lively and coherent. None of this changes the model — only how you read its probabilities. This is why two apps using the identical model can have very different "personalities," and why we expose these as runtime flags (Chapter 40) rather than baking them in. Generation quality is half model, half decoding.
Implement topP(logits, p): sort indices by probability descending, accumulate until the running sum reaches p, and -inf the rest. Then build a small Sampler struct holding { temperature, top_k, top_p, seed } and a next(logits) method that applies them in order. That struct is the exact decoder you'll point at GPT-2's logits in Part 6.
You have both ends of a language model: embeddings to turn tokens into vectors, and a full sampling toolkit (greedy, temperature, top-k, top-p) to turn logits back into tokens. Combined with the tokenizer from Chapter 18, you can get text in and text out. The only thing missing is the powerful part in the middle — the transformer — which is all of Part 5.
The Transformer
The payoff of all the kernels. Five chapters that build self-attention — the mechanism that lets a token gather information from every other token — then stack it into multi-head attention, the transformer block, and a complete tiny GPT. You'll assemble it, run it, and meet the generation loop that turns a forward pass into flowing text.
Self-attention
~1 weekThis is the idea that changed everything. The bigram forgot all but the last token; attention lets every token look at every earlier token and pull in exactly the information it needs. It's how "it" finds its referent, how a verb finds its subject, how context becomes meaning. And remarkably, it's built almost entirely from the dot product you've known since Chapter 7. Take your time here — this one mechanism, understood deeply, is the whole game.
Query, key, value
Each token produces three vectors by multiplying its embedding through three learned weight matrices. The query says "here's what I'm looking for." The key says "here's what I contain." The value says "here's what I'll contribute if you attend to me." A token compares its query against every token's key (dot product = relevance), turns those relevances into weights with softmax, and takes the weighted sum of the corresponding values. The result is a new vector for that token, blended from the whole context according to relevance.
The famous line is Attention(Q,K,V) = softmax(QKᵀ / √d_k) V. Read left to right: QKᵀ is every query dotted with every key — a matrix of relevance scores. Divide by √d_k (the head dimension) to keep the scores from growing too large and saturating the softmax. softmax over each row turns scores into weights that sum to 1. Multiplying by V takes, for each token, the weighted average of all value vectors. Every piece — matmul, scale, softmax, matmul — is a kernel you already wrote. Attention is those kernels in a specific arrangement.
The causal mask
For a language model generating left to right, a token must not see the future — at position i it may attend only to positions 0…i. We enforce this "causal" mask by simply not computing scores for future keys (or setting them to −∞ before softmax, which zeroes their weight). This is the single difference between an encoder's attention (sees everything) and a decoder's (sees only the past), and it's why GPT can generate one token at a time.
Single-head self-attention, in code
We project the input x: [n, d] to Q, K, V with our linear, then for each token run the score-softmax-blend over the allowed keys. Written plainly first; we optimize later.
The bigram's "memory" was one token. Attention's memory is the entire context, and it's content-addressed — a token retrieves information by relevance, not by position. When the model processes "the trophy didn't fit in the suitcase because it was too big," the query from "it" can light up the key from "trophy" and pull in its meaning. No fixed table of rules could do that; attention learns to route information dynamically. That dynamic, learned routing is why transformers crushed everything that came before.
Attention has many places to get an index or a transpose backwards, and a subtle bug produces fluent-but-wrong output. Build the reference in four NumPy lines — S = (Q@K.T)/sqrt(d); mask the upper triangle to -inf; P = softmax(S); out = P@V — feed both the same small random x and weights, and compare to 1e-4. Verify the causal property explicitly: changing a future token must not alter an earlier token's output. If it does, your mask is wrong.
Two classic attention bugs. Skip the 1/√d scale and, for larger d, the dot products grow big, softmax saturates to one-hot, and gradients/quality collapse. Mask incorrectly — attending to position i+1 at step i — and the model "cheats" by peeking at the answer, which looks fine in a forward pass on full text but breaks generation. Our loop sidesteps the second by only iterating j ≤ i; keep that invariant sacred.
You built self-attention from the dot product up, with a causal mask, and you can explain query/key/value as "what I want / what I have / what I give." This is the conceptual summit of the entire book — everything from here is replication and scaling. Next we run several attentions in parallel (multi-head) to let the model attend to different things at once.
Multi-head attention
~3–5 daysOne attention computes one kind of relationship at a time. But language has many simultaneous structures — grammatical subject, recent topic, matching brackets, long-range references. Multi-head attention runs several attentions in parallel, each in its own slice of the vector, so the model can attend to several things at once and combine the results. It's a small change to the code and a big change in capability.
Split the vector into heads
Rather than do one attention over the full width d, we split each token's Q, K, V into n_head chunks of size head_dim = d / n_head, run an independent attention within each chunk, and concatenate the per-head results back into a width-d vector before the output projection. Crucially, the scale becomes 1/√head_dim — the per-head dimension, not the full d.
The only structural change from Chapter 20 is the outer loop over heads and the + off on every slice — each head reads and writes its own head_dim-wide band of the vectors, completely independently. The output projection Wo is what lets the heads' separate findings interact.
When researchers probe trained models, heads specialize in interpretable ways: some track the previous token, some match a verb to its subject, some link a pronoun to its antecedent, some close brackets and quotes. No one programs this — it emerges from training because different relationships are useful for prediction. Multi-head attention gives the model the capacity for this division of labor; d/n_head is the per-specialty budget. GPT-2 small uses 12 heads of 64 dimensions each; Gemma uses similar head sizes. Sixty-four is a remarkably common head dimension across the whole field.
A clean test: run multiHeadAttention with n_head = 2, then run your single-head function twice on the two halves of the projected Q/K/V and concatenate. The results must match to floating-point tolerance — multi-head is exactly independent attentions on disjoint slices. If they differ, you've got an offset wrong. Also re-confirm the scale uses head_dim, not d: a common copy-paste bug from the single-head version where the two happened to be equal.
Head h occupies columns [h·head_dim, (h+1)·head_dim) of each row. When you later load real weights, the Q/K/V projection matrices are laid out so this slicing lines up — but if a model stores heads in a different interleaving (some do), your slices read the wrong columns and every head mixes garbage. When porting a new model, confirm the head layout before trusting attention. It's the kind of thing that produces plausible-looking nonsense.
You have full multi-head causal self-attention — the engine of every transformer — built from your own kernels and tested against the single-head version. You understand why multiple heads exist and how they divide the work. Now we wrap attention with normalization, a feed-forward network, and residual connections to form the repeatable transformer block.
The Transformer block
~3–5 daysA transformer is a stack of identical blocks, and a block is just two sub-layers — attention and a small feed-forward network — each wrapped in a normalization and a residual connection. That wrapping is what lets us stack dozens of blocks without the signal degrading. Once you have one block, the whole model is a loop that runs it N times.
The residual stream
Picture a running vector for each token — the residual stream — that flows straight through the whole model. Each sub-layer doesn't replace it; it reads a normalized copy, computes a contribution, and adds that back. So information travels untouched down the residual highway unless a sub-layer chooses to modify it. This "add, don't overwrite" design (the residual connection) is the reason very deep networks train and run stably: there's always a clean path for the signal and the gradient.
Pre-norm vs post-norm
Where the normalization goes matters. GPT-2 (and every model since) uses pre-norm: normalize before each sub-layer, add the raw result back to the residual. The original 2017 transformer used post-norm (normalize after adding) and was notoriously hard to train deep. Pre-norm keeps the residual stream un-normalized and clean, which is why it won. Our block is pre-norm.
The feed-forward network
After attention mixes information across tokens, the MLP lets each token process what it gathered, independently. It's two linear layers with an activation between, and it expands to a wider hidden size — typically 4×d — before projecting back. This is where much of the model's raw parameter count and "knowledge" lives.
A useful two-line summary of a transformer block: attention moves information between tokens (it's the only part where tokens see each other), and the MLP transforms each token's information in place (it never looks sideways). Alternating these — gather, then process, gather, then process — over many layers is how a transformer builds up understanding. Every block has the same shape; depth comes from repetition. When you grasp one block, you grasp the architecture.
Each forward allocates several temporaries (h, att, mid, back) — but the only thing that must survive is the residual stream x, which the caller owns. So pass an arena and reset it after each block: the long-lived x lives in your main allocator, the block's scratch is reclaimed wholesale every layer. Resetting inside a block would be a disaster (you'd free att while still adding it); reset only at the block boundary. This is the long-lived-vs-scratch split from Chapter 4, now load-bearing.
With all weights set so the sub-layers output zero (e.g. zero the output projections), a block must leave x exactly unchanged — that proves the residual paths are wired correctly. Then check shapes survive: in [n, d] out [n, d], with the hidden expansion hitting 4d in the middle. Shape bugs in the MLP (mixing up d and 4d) are common and instantly caught by a dimension assertion.
You have a complete transformer block: pre-norm, multi-head attention, residual, pre-norm, feed-forward, residual. This is the unit that, stacked and wrapped with embeddings and an output head, is GPT. Assembling that full model is the next chapter.
A tiny GPT
~1 weekEverything assembles now. A GPT is: embed the tokens and add positions, run the residual stream through a stack of transformer blocks, apply a final normalization, and project to vocabulary-sized logits. That's the whole architecture — the same shape whether it's our toy or GPT-2 itself. We'll build it as a struct with a config, wire the forward pass, and run it on random weights to confirm the plumbing.
Config and structure
Step 4 reuses tok_emb as the output projection — no separate "LM head" matrix. This is weight tying, and GPT-2 does exactly this. The intuition: the embedding maps "token → vector," and the output maps "vector → token scores"; making them transposes of each other ties the two directions of the same relationship, saving a big [vocab, d] matrix and often improving quality. Because we store tok_emb as [vocab, d] row-major, linear(logits, x, tok_emb, …) computes x · tok_embᵀ precisely — the tying falls out for free.
Run it on random weights
Allocate the tables and blocks, fill with small random numbers, and push a few tokens through. The output is meaningless (untrained), but every shape lines up and the full pipeline executes — embeddings, attention, MLP, norm, logits.
buildRandomGPT (yours to write — allocate each table and each block's weights from the long-lived allocator, fill with scaled random values) mirrors how loadGPT will work in Part 6: same structs, same shapes, but reading bytes from a file instead of a random generator. That symmetry is deliberate. The forward never knows or cares whether the weights came from std.Random or from gpt2.safetensors — which is exactly why the tiny GPT you run today becomes the real GPT-2 you run in two chapters with zero changes to the math.
Assert the output is [n, vocab] and that running the same input twice gives identical logits (no uninitialized memory leaking in). Then a structural check: feeding n tokens should give you n rows of logits, one prediction per position — that's what makes training efficient (every position is a supervised example at once) even though generation only consumes the last row.
A complete GPT forward pass runs in your Zig: tokens in, logits out, through real multi-head attention and feed-forward blocks you built. The architecture is done. What remains is to make it say something — the autoregressive generation loop — and to understand how its weights were trained, which is the next chapter and the close of Part 5.
Training & generating
~1 weekTwo things finish Part 5: making the model generate (the autoregressive loop, which is pure inference and fully runnable now), and understanding how its weights are trained (the same loss and gradients from Part 3, scaled up). Generation is the skill we carry into the rest of the book; training we'll demonstrate honestly at the scale our scalar autograd allows, then hand the heavy lifting to the pretrained weights we load in Part 6.
The generation loop
Generation is the next-token game on repeat: feed the current tokens, take the logits for the last position, sample one token, append it, and go again — until you hit a stop token or a length limit. This is the inference loop you'll reuse for GPT-2, Gemma, and (with a twist) Whisper.
Notice we re-run the entire sequence through the model every single step, just to get the last token's logits. Generating m tokens costs work proportional to m² — wasteful, because positions 0…n-1 produce the same keys and values every time. The fix is the KV cache (Chapter 28): remember each layer's past keys and values, and each step only computes the new token. We write the simple quadratic version first because it's obviously correct and makes a perfect reference to validate the cached version against.
Measuring the model: loss and perplexity
Even without training, you can evaluate a model on text — which is forward-only and runs right now. Run the sequence, and at each position score how surprised the model was by the actual next token with cross-entropy. The average is the loss; its exponential is perplexity.
A random model scores a loss near ln(vocab) (it's guessing uniformly); a trained model scores far lower. Watching this number is how you'd know training is working — and how you'll confirm, in Part 6, that you loaded GPT-2's weights correctly (its loss on real text should be low).
How the transformer is trained — and why we don't, here
Training this GPT is exactly the loop from Part 3: forward to get logits, cross-entropy against the next tokens, backpropagate to every weight, step. The only new ingredient is an autograd that works on tensors (so a single matmul node handles its whole gradient) rather than the scalar engine of Chapter 14. Extending the engine that far — backward rules for matmul, softmax, LayerNorm, and attention — is a wonderful project, but it's a book of its own, and training even a small transformer to coherence wants a GPU and patience.
Our goal is an inferencer — to run real models and understand every number. Training GPT-2 from scratch cost tens of thousands of dollars of compute; reproducing that isn't the point and isn't necessary. The weights are published precisely so people can study and run them. You've already trained networks from scratch (the classifier, XOR, the neural bigram), so you know exactly what training is. From here we do the impressive, tractable thing: take those hard-won weights and bring them to life. If you want to train a transformer yourself, Karpathy's nanoGPT and makemore are the canonical next step, and your kernels map directly onto them.
Wire the tokenizer (Chapter 18), embeddings, the tiny GPT, and this generation loop into one program: text prompt in, generated text out, even if the output is gibberish from random weights. Getting the pipeline connected end to end — encode → forward → sample → decode — is the milestone; in Part 6 you'll drop real weights into the exact same pipeline and it will suddenly speak English.
You can generate text autoregressively from a transformer you built, evaluate its loss and perplexity, and you understand precisely how its weights would be trained. The architecture and the inference loop are complete and correct (if slow). Everything is in place for the moment the book has been building toward: loading the real, pretrained GPT-2 and watching your own code produce fluent English.
GPT-2, for real
The moment it all pays off. Four chapters to read the actual pretrained GPT-2 weights, match its architecture down to the last detail, run the forward pass, and generate fluent English from your own Zig — then make generation fast with a KV cache. Same kernels, same model shape you already built; now with weights that cost a fortune to train and are yours to run.
Loading real weights (safetensors)
~3–5 daysPretrained weights live in files, and the modern format is safetensors — a dead-simple, safe layout that's just a small JSON header followed by raw tensor bytes. No code executes on load (unlike pickled PyTorch files), and the data is a flat blob you can read straight into your Tensors. Learn this one format and you can load GPT-2, Gemma, Llama, and almost everything else on Hugging Face.
Get the files
GPT-2 (124M) is on Hugging Face as openai-community/gpt2. You want model.safetensors (the weights), plus vocab.json and merges.txt for the tokenizer from Chapter 18. Download them into models/gpt2/.
The safetensors layout
The whole format, front to back: an 8-byte little-endian integer giving the header length; then that many bytes of JSON mapping each tensor name to its dtype, shape, and byte range; then the tensor data, with each tensor's data_offsets measured from the start of the data section.
A loader in Zig
Read the file, split off the header, parse it with std.json, and build a table from tensor name to a view of its bytes. We convert each tensor to f32 on access (GPT-2 is already f32; Gemma will be bf16, so we handle both).
That one-line bf16ToF32 is worth savoring. The bfloat16 format is literally the high 16 bits of an IEEE f32 — same exponent, fewer mantissa bits — so converting up is just a 16-bit left shift back into place. No special-casing of exponents, no table. f16 (IEEE half) is different and needs a real conversion, which Zig's native f16 type handles via @floatCast. Knowing which is which saves you from hand-rolling bit twiddling that Zig already does correctly.
After opening the file, print a few tensor names and shapes and confirm they match the architecture (e.g. wte.weight should be [50257, 768]). Spot-check a couple of values against what Python sees: from safetensors import safe_open; f.get_tensor("wte.weight")[0][:5]. If your first five numbers match Python's, your offsets and dtype handling are correct — and every weight after that will be too. A loader bug corrupts everything downstream, so prove it here.
readFileAlloc pulls the whole file into memory — fine for GPT-2's 500 MB on a normal machine, and simplest to reason about. For multi-gigabyte models (Gemma, large Whisper) you'd rather memory-map the file so the OS pages in only what you touch, and keep weights in their on-disk dtype to halve RAM. We'll note where that matters; for learning, read-it-all keeps the code clear, and the conversion-to-f32 on load means the rest of your engine never thinks about dtypes again.
You can open a safetensors file, parse its header, and pull any tensor out as f32, handling F32/F16/BF16. The pretrained weights are now reachable from Zig. Next we map them onto the exact GPT-2 architecture — which has a couple of specific quirks you must get right.
GPT-2, exactly
~3–5 daysYour tiny GPT and the real GPT-2 are the same architecture — but "the same" has to be exact, and GPT-2 carries a couple of historical quirks that will silently corrupt your output if you miss them. This chapter is the precise specification: the config, the tensor names, the one transpose that trips up everyone, and the small constants (which GELU, which epsilon) that must match.
The config (124M)
| Field | Value | Meaning |
|---|---|---|
| n_vocab | 50257 | 50000 BPE merges + 256 bytes + 1 <|endoftext|> |
| n_ctx | 1024 | max sequence length (positional table size) |
| d_model | 768 | residual stream width |
| n_head | 12 | attention heads (head_dim = 64) |
| n_layer | 12 | transformer blocks |
| MLP hidden | 3072 | 4 × d_model |
The larger GPT-2 sizes only change these numbers (medium 1024/16/24, large 1280/20/36, XL 1600/25/48). Read them from the model's config.json and your loader handles every size — the same "read shapes from the file" discipline that will later make Gemma and Whisper just-work.
The tensor names
| Name | Shape (as stored) | Role |
|---|---|---|
wte.weight | [50257, 768] | token embeddings (also the tied output head) |
wpe.weight | [1024, 768] | learned positional embeddings |
h.{i}.ln_1.{weight,bias} | [768] | pre-attention LayerNorm |
h.{i}.attn.c_attn.{weight,bias} | [768, 2304] | combined Q,K,V projection |
h.{i}.attn.c_proj.{weight,bias} | [768, 768] | attention output projection |
h.{i}.ln_2.{weight,bias} | [768] | pre-MLP LayerNorm |
h.{i}.mlp.c_fc.{weight,bias} | [768, 3072] | MLP up-projection |
h.{i}.mlp.c_proj.{weight,bias} | [3072, 768] | MLP down-projection |
ln_f.{weight,bias} | [768] | final LayerNorm |
The highlighted rows are Conv1D layers from OpenAI's original TensorFlow code, and their weight is stored as [in_features, out_features] — the transpose of PyTorch's nn.Linear (and of our [n_out, n_in] convention). So c_attn.weight is [768, 2304], not [2304, 768]. If you feed these straight into your linear, every projection is scrambled and the model emits confident garbage. The fix is one transpose per Conv1D weight at load time. This single detail is the most common reason a from-scratch GPT-2 "doesn't work."
Splitting the combined QKV
GPT-2 packs Q, K, and V into one matrix, c_attn. After transposing it to [2304, 768] (our convention), the first 768 rows are the Q projection, the next 768 are K, the last 768 are V — and the bias splits the same way. Slice them apart and feed them to the multi-head attention you already wrote.
Three more details, each a silent-bug if wrong. GPT-2 uses the tanh-approximation GELU (your gelu from Chapter 10 is already correct). Its LayerNorm epsilon is 1e-5. And the output head is the token embedding, tied — there's no separate lm_head tensor in the file. Attention uses biases on Q, K, V, and the output projection (unlike many modern models, which drop them). Match these and your numbers will track the reference; miss one and you get the maddening "almost right" failure.
Don't load all twelve layers and hope. Load the embeddings, run one token, and compare the post-embedding vector to PyTorch's. Then one block, compare the output. Then two. The first layer that diverges from the reference is where your bug is — almost always a missing transpose, the wrong GELU, or a Q/K/V split offset. Appendix D shows how to dump these intermediates from transformers. Bisecting against the oracle turns "the model is broken" into "block 0's attention output is off," which is a fixable sentence.
You have the exact GPT-2 specification: config, tensor names, the Conv1D transpose, the QKV split, and the constants (tanh GELU, eps 1e-5, tied head, attention biases). You know precisely how the bytes in model.safetensors map onto the structs you built in Part 5. Time to wire it together and generate.
Forward pass & first words
~3–5 daysThis is the chapter the whole book has been climbing toward. We connect the loader to the model: read every tensor, transpose the Conv1D weights, split the QKV, fill the structs — then run the generation loop from Chapter 24 and watch your own Zig produce fluent English. No frameworks underneath. Every multiply was a kernel you wrote.
Wiring weights into the model
The load function walks the layers, pulling each tensor by name and placing it where the architecture expects. The Conv1D weights get transposed; c_attn gets split. It's mechanical — the care is all in matching names and applying the transpose, which Chapter 26 spelled out.
Generate
Tokenize a prompt, run the loop, decode the result. With real weights, the magic happens.
That text came out of your code. Tokenizer, embeddings, twelve transformer blocks of multi-head attention and feed-forward, final norm, tied output head, sampling — all kernels you built and tested, running real GPT-2 weights. Pause and appreciate it. This is the deliverable the book promised in Chapter 0.
We've printed with std.debug.print (stderr) throughout — perfect for development. To stream generated text to stdout (so you can pipe it), use the buffered writer: var buf: [4096]u8 = undefined; var w = std.fs.File.stdout().writer(&buf); const out = &w.interface; then try out.print("{s}", .{piece}); and try out.flush(); at the end. The exact writer API shifted in the 0.15/0.16 "Writergate" rework, so if it doesn't resolve, check your version's std.fs.File docs — the concept (a writer over a stack buffer, flush at the end) is stable.
A Debug build runs every safety check and will make GPT-2 feel glacial — generate in -Doptimize=ReleaseFast for usable speed (correctness first in Debug, then switch). Two more gotchas: GPT-2's positional table is only 1024 long, so don't let the sequence exceed n_ctx; and remember the loop is still quadratic (Chapter 24) — a long generation crawls until we add the KV cache next chapter. If output is garbage rather than slow, it's almost certainly a load bug (transpose/split) — bisect with the oracle.
Your Zig generates English from real GPT-2. The inferencer exists. You can change the prompt, the temperature, the number of tokens, and watch a model you understand completely respond. Everything from Chapter 7's dot product to here was in service of this moment. Now we make it fast enough to enjoy — the KV cache.
The KV cache
~1 weekThe generation loop works but wastes enormous effort: every step re-runs the entire sequence to compute one new token. Yet the keys and values of past tokens never change — attention at step p only needs the new query against all keys and values, most of which it already computed. Caching them turns generation from quadratic to linear and is the single most important inference optimization. It's also the structure that makes long chats feasible.
The insight
In causal self-attention, token p's output depends on its query and on the keys/values of tokens 0…p. When we move to token p+1, the keys and values for 0…p are identical — same inputs, same weights. So we store them once, per layer, and at each step compute only the new token's K and V, append them, and attend the new query over the whole cache. We never recompute a past key or value again.
The cache and the cached step
Each layer gets a K buffer and a V buffer sized to the maximum context, plus a length. The attention step is the multi-head loop from Chapter 21, but the new query attends over cache.len keys instead of recomputing them.
Notice the causal mask vanished: it's implicit now, because the cache only ever contains tokens 0…p. There are no future keys to mask — they don't exist yet. The single-token formulation makes causality automatic.
Prefill, then decode
Generation now has two phases. Prefill: run the prompt through once to populate the caches (you can do this token by token with the same step). Decode: from then on, each step feeds exactly one token — the last one generated — through all layers using attnStep, producing logits for the next. Each step is now constant work (plus a cheap walk over the growing cache), so generating the 500th token costs about the same as the 5th.
You have a perfect reference: the quadratic generator from Chapter 24. With the same prompt, seed, and greedy decoding (temperature 0), the KV-cache version must produce the identical token sequence. If it diverges, the usual culprits are an off-by-one in c.len (appending before vs after attending), reading cache[len] before writing it, or a wrong head offset into the cache. Diff the first logits vector between the two paths at tolerance 1e-4 to localize it instantly.
The KV cache is long-lived — it must persist across every generation step — so it lives in your main allocator, sized to n_ctx up front, not in the per-step arena. Only the step's temporaries (q, ctx, scores) come from the arena you reset each token. Mixing these up (putting the cache in the arena) silently erases history on the next reset and the model forgets everything but the current token. Keep the Chapter 4 split crisp: weights and cache are kept; activations are scratch.
The cache size is fixed at n_ctx (1024 for GPT-2), and that's exactly the "context window" you hear about. Every cached token costs memory (two vectors per layer) and every new token attends over the whole cache, so longer context means more memory and slightly more work per step. This is the real, physical reason models cap their context — and why the field works hard on bigger, cheaper caches. You've now built the data structure at the center of that entire conversation.
Your GPT-2 generates at linear cost with a KV cache, matching the slow path token-for-token. You have a genuine, efficient inference engine: load real weights, prefill a prompt, decode fluently. This is a complete, real system built entirely from your kernels. Part 7 takes the same engine and modernizes it into a model you'd actually reach for today — Gemma.
The modern stack
GPT-2 is 2019. The transformers you'd run today — Gemma, Llama, Qwen — keep the same skeleton but modernize nearly every component. Four chapters to make those swaps one at a time: rotary positions, RMSNorm, gated MLPs, grouped-query attention — and then assemble them into a working Gemma inferencer. You'll watch a 2019 model become a 2025 one, change by change.
What changed since GPT-2
~½ dayThe good news up front: the architecture you built is still correct. Modern decoder-only models are GPT-2 with better-engineered parts. None of the changes alter the big picture — embed, stack of (attention + MLP) blocks with residuals, norm, project to logits. They swap individual components for ones that train more stably, run faster, or generalize to longer context. Here's the map before we implement each piece.
| Component | GPT-2 (2019) | Modern / Gemma (2024–25) | Why it changed |
|---|---|---|---|
| Position | learned absolute table | RoPE (rotary) | relative, extends past training length, no table |
| Normalization | LayerNorm, pre-norm | RMSNorm, pre + post norm | cheaper, fewer params, more stable |
| MLP | Linear → GELU → Linear | gated: GeGLU / SwiGLU | a learned gate improves quality per param |
| Attention | multi-head (MHA) | grouped-query (GQA) | far smaller KV cache, faster decode |
| Biases | on all linears | usually none | biases barely help; dropping them is simpler/faster |
| Output head | tied to embedding | tied (Gemma) / separate (Llama) | tying saves a big matrix |
| Vocab | 50,257 | 256,000 (Gemma) | bigger vocab, shorter sequences, multilingual |
The beauty of building from kernels is that these are local swaps. RoPE replaces the positional-embedding add. RMSNorm replaces LayerNorm. The gated MLP replaces the two-linear MLP. GQA changes how many K/V heads you project. You can flip each one independently and keep the rest of your engine — the loop, the cache, the sampler — untouched. That's exactly how we'll proceed: one kernel per chapter, each tested in isolation, then assembled. Understanding the deltas is more valuable than memorizing one model, because every new model is just a different combination of these same parts.
Beyond the generic modern stack, Gemma adds a few specific touches you must match: it scales the token embeddings by √d_model right after lookup; its RMSNorm uses (1 + weight) rather than just weight; Gemma 2 applies soft-capping (a tanh squash) to attention scores and final logits, while Gemma 3 swaps that for QK-norm; and it interleaves local sliding-window and global attention layers. We'll implement the core (RoPE, RMSNorm, GeGLU, GQA, embedding scale, soft-cap) and treat sliding-window as an optimization you can add. Each fingerprint is a one-liner once you know it exists — and a baffling bug if you don't.
You have the full map of what separates a 2019 transformer from a 2025 one, and you know the four core swaps plus Gemma's specific quirks. None of it threatens the architecture you built — it refines it. Let's implement the swaps, starting with the cleverest: rotary position embeddings.
RoPE: rotary positions
~3–5 daysGPT-2 told the model about position by adding a learned vector. Modern models do something more elegant: they rotate each query and key vector by an angle proportional to its position. The dot product between a rotated query and a rotated key then depends only on their relative position — which is exactly what attention should care about. This is RoPE, rotary position embedding, and it has no learned parameters at all.
The idea: encode position as rotation
Split a head's vector into pairs of dimensions and treat each pair as a 2-D point. Rotate each pair by an angle that grows with the token's position (and differs per pair, so different pairs spin at different speeds). Because rotating a query by angle mθ and a key by nθ makes their dot product depend on (m−n)θ, the attention score between two tokens automatically encodes how far apart they are — relative position, for free, baked into the geometry.
For a head of dimension hd, pair dimension i with dimension i + hd/2 (the convention Hugging Face uses). Each pair gets a frequency θᵢ = base^(−2i/hd) — low pairs rotate slowly (encode coarse, long-range position), high pairs fast (fine position). At position p the angle is p·θᵢ, and the pair (a, b) becomes (a·cos − b·sin, b·cos + a·sin). The base (often 10,000; Gemma 3 uses 1,000,000 on its global layers) sets how slowly the slowest pair turns, which controls the maximum context the scheme handles gracefully.
Implementation
Apply RoPE to each head's slice of Q and K (never V) right before computing attention scores. It's an in-place rotation.
In attention you now call ropeRows on Q and K after projection and before the scores loop. Everything downstream — softmax, the value blend, the KV cache — is unchanged. (For the cached decode step, rotate the new token's Q and K by its absolute position before appending K to the cache; the cached keys were already rotated at their positions, so relative geometry is preserved.)
GPT-2's absolute table learns a distinct vector for "position 5" and "position 500," and it simply has no entry beyond position 1023 — feed it a longer sequence and it breaks. RoPE encodes only differences, so "five tokens back" means the same thing whether you're at position 5 or 5,000. That's why modern models can stretch to context lengths far beyond what they trained on (sometimes with a tweak to base), and why they dropped the learned table entirely. Same attention math, smarter position encoding.
There are two ways to pair dimensions: adjacent pairs (0,1),(2,3),… (the original paper) and split-half pairs (0, hd/2),(1, hd/2+1),… (Hugging Face, used above). They're mathematically equivalent only if the weights were trained with the same convention. Load Gemma/Llama weights and apply adjacent-pair RoPE and you'll get subtly wrong attention — fluent gibberish again. Since we load HF weights, use the split-half version shown here. When in doubt, check the reference implementation's rotate_half.
RoPE's defining property is checkable: take two vectors q, k, rotate q at position m and k at position n, and the dot product should depend only on m − n. Verify dot(rope(q,5), rope(k,3)) ≈ dot(rope(q,7), rope(k,5)) — same relative gap, same score. And cross-check one rotated vector against the HF implementation in NumPy. If the relative property holds and one vector matches, your RoPE is correct.
You implemented RoPE and understand why rotation encodes relative position better than a learned table. Your attention can now use modern positioning — the first and trickiest of the Gemma swaps. Three smaller ones remain: RMSNorm, the gated MLP, and grouped-query attention.
RMSNorm, GeGLU & GQA
~1 weekThree more swaps and you'll have every modern component. RMSNorm replaces LayerNorm (you already have the kernel — we just add Gemma's twist). The gated MLP replaces the plain one. And grouped-query attention shrinks the KV cache that dominates memory during generation. Each is a small, testable kernel.
RMSNorm, the Gemma way
You built RMSNorm in Chapter 10. Gemma adds one wrinkle: it scales by (1 + weight) rather than weight, so a zero-initialized weight means "identity," which trains nicely. The epsilon is 1e-6.
The gated MLP (GeGLU)
GPT-2's MLP was up-project, activate, down-project. The modern version adds a gate: a second up-projection that, after an activation, multiplies the first elementwise — letting the network learn to suppress or pass each hidden unit. Gemma uses GeGLU (gate activated by GELU); Llama uses SwiGLU (SiLU). Three weight matrices instead of two: gate, up, down.
The plain MLP applies the same nonlinearity to every hidden unit unconditionally. The gate makes the activation input-dependent: for each hidden unit, one branch decides "how much should this pass?" and multiplies the other branch by it. It's a learned, per-unit volume knob. Empirically, gated MLPs match a plain MLP's quality with fewer parameters (which is why Gemma's intermediate size is tuned around three matrices), and they've become standard across Llama, Gemma, Qwen, and the rest. The cost is one extra projection.
Grouped-query attention (GQA)
In full multi-head attention, every query head has its own key and value heads — and the KV cache stores all of them, which is the bulk of generation memory. GQA observes that you can share: use many query heads but only a few KV heads, with each KV head serving a group of query heads. Gemma 2 2B uses 8 query heads and 4 KV heads (groups of 2); some models go further (8 query heads sharing 1 KV head). The KV cache shrinks proportionally.
The only change from Chapter 28's attnStep is the index math: query head h reads KV head h / group, and the cache rows are n_kv·hd wide instead of d. The value-blend uses the same kv_head mapping. Everything else — softmax, RoPE, the loop — is identical.
During generation, the KV cache — not the weights — often dominates both memory and bandwidth, and bandwidth is what limits decode speed. Halving the KV heads halves the cache and roughly halves the memory traffic per token, for a barely-measurable quality cost. That's why essentially every model since 2023 uses GQA. You're implementing it as a few index changes; in production it's the difference between a model fitting on your machine or not.
Set n_kv = n_head (group size 1) and your GQA code must produce identical results to the multi-head attention from Chapter 21 — that's the degenerate case where every query head has its own KV head. It's a perfect regression test for the index math. Then set n_kv = 1 and confirm every query head reads the same single KV head. Get these two endpoints right and the in-between (Gemma's 4) is correct by construction.
You have all four modern kernels: RoPE (last chapter), Gemma's RMSNorm, the GeGLU gated MLP, and grouped-query attention — each tested against the component it replaces. The modern transformer is fully in your toolbox. Now we assemble them, with Gemma's specific quirks, into an inferencer that runs a real 2025 model.
A Gemma inferencer
~1–2 weeksAssemble everything into a real, current model. We target Gemma 2 2B — small enough to run on a laptop, well-documented, and a clean showcase of the modern stack plus a few Gemma-specific touches. You already have every kernel; this chapter is about the exact wiring and the fingerprints that make it Gemma rather than generic.
The config (Gemma 2 2B)
| Field | Value | Note |
|---|---|---|
| hidden_size (d) | 2304 | residual width |
| num_hidden_layers | 26 | blocks |
| num_attention_heads | 8 | query heads |
| num_key_value_heads | 4 | GQA: groups of 2 |
| head_dim | 256 | note: 8×256=2048 ≠ 2304 — heads are decoupled from d |
| intermediate_size | 9216 | GeGLU hidden |
| vocab_size | 256000 | large multilingual vocab |
| rope_theta | 10000 | RoPE base |
| rms_norm_eps | 1e-6 | for RMSNorm |
| query_pre_attn_scalar | 256 | attention scale = 1/√256 |
| attn_logit_softcapping | 50.0 | cap on attention scores |
| final_logit_softcapping | 30.0 | cap on output logits |
| sliding_window | 4096 | on alternating (local) layers |
Gemma weights are gated: accept the license on Hugging Face (google/gemma-2-2b), then download with an access token. The files are bf16 (~5 GB), which your loader already handles via bf16ToF32 — though converting all of it to f32 doubles RAM to ~10 GB, so this is where memory-mapping and keeping weights in bf16 start to matter (Chapter 25's note). For a first run on a machine with enough RAM, convert-to-f32 is fine and keeps the code identical to GPT-2's path.
The five Gemma fingerprints
Generic modern stack aside, these are the things that make output correct for Gemma specifically. Miss any one and you get fluent nonsense.
The Gemma block: norm sandwiches
Gemma 2 wraps each sub-layer in both a pre-norm and a post-norm — four RMSNorms per block, versus GPT-2's two LayerNorms. The structure:
Inside gqaAttention: project Q/K/V (no biases), apply RoPE to Q and K, compute scores with scale 1/√256, softcap(scores, 50) before the softmax, blend values with GQA's head sharing, output-project. The full-model forward then is: look up embeddings, scaleEmbedding, run 26 blocks, final RMSNorm, tied projection to logits, softcap(logits, 30), sample.
Tensor names (HF Gemma 2)
Good news on the transpose front: these are stored in the standard nn.Linear [out, in] layout — our convention — so no transposing, unlike GPT-2's Conv1D weights. And there are no biases to load. The loader is actually simpler than GPT-2's; the complexity moved into the block's extra norms and the fingerprints.
Gemma 2 alternates "local" layers (attention limited to the last 4096 tokens) with "global" layers (full attention). For any prompt shorter than 4096 tokens, local and global are identical — so for a first working inferencer you can ignore sliding window entirely and treat every layer as global. Add the window later as an optimization for very long contexts: it just caps how far back each local layer's attention loop reaches. Getting Gemma talking does not require it; correctness on normal prompts is unaffected.
Gemma has enough fingerprints that you should validate against the reference before trusting generation. Load Gemma 2 2B in Python transformers, run a short prompt, and dump the logits for the first position. Run your inferencer on the same tokens and compare — aim for agreement within a few percent (bf16→f32 and tanh approximations cause small differences). If logits are wildly off, check the five fingerprints in order: embedding scale, then RMSNorm (1+w), then the attention scale, then soft-capping, then the norm-sandwich placement. One of those is almost always the culprit.
If you want the very newest: Gemma 3 keeps this skeleton and changes a few things. It drops soft-capping in favor of QK-norm (an RMSNorm applied to queries and keys before attention), uses a 5-local-to-1-global layer pattern with a higher RoPE base (1,000,000) on the global layers, and extends context to 32K. Each is a small kernel swap on the engine you just built: add a norm in the attention path, change which layers are local, branch the RoPE base by layer type. The fact that "the newest model" is a handful of deltas on your code is the whole point of having built from kernels.
Your engine runs a modern 2025 model: RoPE, RMSNorm, GeGLU, GQA, plus Gemma's embedding scale, soft-capping, and norm sandwiches — generating text from real Gemma weights. You went from GPT-2 to Gemma by swapping kernels, not rewriting. The same engine now spans six years of architecture. One frontier remains in this book: turning the model's ear on — speech, with Whisper.
To speech: the road to Whisper
The same transformer, rewired to listen. Whisper adds an encoder that reads audio and a cross-attention bridge so the decoder can predict text while attending to sound. You've built every kernel it needs; these four chapters cover the new wiring, the one new domain piece (audio → mel), and how it all composes — then hand off to the companion Whisper in Zig book for the full production build, and close with performance and a CLI.
Encoder–decoder & cross-attention
~1 weekGPT and Gemma are decoder-only: one stack that reads text and predicts the next token. Whisper is encoder–decoder: an encoder digests the entire input (audio) into a rich representation, and a decoder generates text while attending to that representation. The decoder is the GPT you already built, plus one extra attention per block — cross-attention — that looks at the encoder's output. That's the whole new idea, and it reuses every kernel you have.
Two stacks, three attentions
The encoder is a transformer stack with bidirectional self-attention — no causal mask, because it sees the whole input at once (the audio isn't being generated left-to-right; it already exists). Its job is to turn input features into contextual vectors. The decoder generates text autoregressively and each block now has three sub-layers instead of two:
Self-attention (causal) is what you built in Part 5. The MLP is unchanged. The only new component is the middle one.
Cross-attention is attention with borrowed K/V
Here's the elegant part: cross-attention is the same attention math, but the queries come from the decoder's current tokens while the keys and values come from the encoder's output. The decoder asks "given what I'm saying, which parts of the audio are relevant?" and pulls in those encoder vectors. Because the encoder output is fixed during decoding, you compute its K and V once and reuse them for every generated token — a cross-attention KV cache that never grows.
You've now seen every kind. Causal self-attention (GPT decoder): Q, K, V all from the same sequence, masked to the past. Bidirectional self-attention (the encoder): same source, no mask. Cross-attention (the bridge): Q from one sequence, K/V from another, no mask. They're one kernel parameterized three ways — by where Q/K/V come from and whether the future is masked. Once you see that, "encoder–decoder" stops being a separate architecture and becomes a recombination of parts you own.
Two things to get right. Cross-attention has no causal mask — a text token may look at any point in the audio, future included, because the audio isn't being generated. And some encoder–decoder models (Whisper among them) omit the bias on certain projections (e.g. the key projection); copy the reference's choice exactly. As always, the failure mode for getting these wrong is fluent-but-wrong output, so validate against the reference rather than by ear.
You understand encoder–decoder structure and cross-attention — the only architectural ideas in Whisper you hadn't already built. The decoder is your GPT plus a borrowed-K/V attention; the encoder is a maskless transformer stack. All that's missing is getting audio into the encoder in the first place, which is the one genuinely new domain in this book.
Audio → mel
~1 weekThe encoder doesn't read raw audio — it reads a log-mel spectrogram, a 2-D image of how the sound's frequency content changes over time. Turning a waveform into that image is the one genuinely new domain in this book: a little signal processing. It's a fixed pipeline (no learned weights), and once you produce a mel that matches the reference, the encoder is just the transformer stack you already know.
The pipeline
Whisper expects 16 kHz mono audio. The steps: slide a short window across the samples, take the Fourier transform of each window to get its frequencies, map those onto a perceptual "mel" scale with a filterbank, take the log, and normalize. The result is a [n_mels, n_frames] array — 80 mel channels for most Whisper sizes, 128 for large-v3.
Human hearing is logarithmic: we distinguish low frequencies finely and high ones coarsely, and we perceive loudness on a log scale. The mel filterbank mimics the first — its triangular filters are packed tightly at low frequencies and spread out at high ones — so the model spends its representation budget where ears do. The final log mimics the second, compressing a huge dynamic range into something stable to compute on. The whole transform is engineered to hand the network the same information your cochlea hands your brain.
The shape of the code
Load samples, then for each frame window run an FFT, square the magnitudes into a power spectrum, apply the mel filterbank (a fixed [n_mels, n_freqs] matrix — itself just a matmul), and take the log. This is a sketch; the fiddly constants (window function, exact normalization) are where bit-for-bit matching lives.
The Fast Fourier Transform turns a window of samples into its frequency components in O(N log N) instead of the naive O(N²). Writing a correct radix-2 FFT in Zig is a satisfying afternoon — it's the same recursive divide-and-conquer everywhere — but it's a self-contained topic, and the companion Whisper in Zig book derives it in full alongside the exact window, mel-filter construction, and normalization Whisper uses. For learning the shape of the audio frontend, the pipeline above is the whole story: window, transform, mel-project, log.
The mel frontend has no learned weights and is fully deterministic, so it's the cleanest possible oracle check: run openai-whisper's log_mel_spectrogram on a WAV in Python, save the array, and compare to yours element-by-element. Aim for agreement to ~1e-3. If frame 0 matches, your windowing and FFT are right; if all frames match, the whole frontend is correct and any later bug is in the transformer, not the audio. Nail this before touching the encoder.
You understand the audio frontend end to end — window, FFT, mel filterbank, log — and how it produces the [n_mels, n_frames] image the encoder consumes. It's the one piece outside the transformer world, and it's a fixed pipeline you can validate exactly. Now we compose the encoder, the decoder, and cross-attention into Whisper.
Whisper, assembled
~1 weekNow we compose. Every piece exists: the mel frontend, the transformer blocks, all three flavors of attention, the KV cache, the tokenizer, the sampler. Whisper is the arrangement. This chapter shows how they fit, then hands you off — because a full, fast, every-size Whisper is exactly what the companion Whisper in Zig book builds, and you now have all the foundations it assumes.
The full pipeline
The encoder begins with a small "conv stem" — two 1-D convolutions that gently downsample the 3000 mel frames to 1500 and lift them into the model width — then runs maskless transformer blocks (Chapter 33) and a final norm, producing 1500 audio vectors. The decoder is your GPT with cross-attention added: each block does masked self-attention over the tokens so far, cross-attention into those 1500 audio vectors, then an MLP. You precompute the cross-attention K/V once from the encoder output (it never changes) and run the generation loop from Part 6.
The special-token prompt
One Whisper-specific detail that's pure orchestration: the decoder doesn't start empty. It's primed with control tokens that tell it what to do — begin-transcript, the language, the task, whether to emit timestamps:
Swap <|transcribe|> for <|translate|> and the same weights translate instead of transcribe — the "task" is just a prompt token. Getting this prefix exactly right is the difference between a perfect transcript and confident gibberish, which is a theme you'll recognize by now.
Look at what Whisper added over your Gemma engine: an audio frontend (Chapter 34), a conv stem (a couple of convolutions), maskless encoder attention, and cross-attention (Chapter 33). The decoder's self-attention, the MLP, the KV cache, the BPE tokenizer, the sampling loop — all already yours, unchanged. That's the reward for building from kernels: a speech-recognition model is mostly a re-wiring of a text model, and you can see exactly which few parts are genuinely about sound.
This is the natural hand-off point. The companion Whisper in Zig book builds the full production engine on precisely the foundations you now have: the complete tensor library, the exact mel frontend and FFT, the conv stem, encoder and decoder with KV caches, the byte-level BPE tokenizer, beam search and timestamps, every model size through large-v3-turbo, all the quantization formats (F16, Q4/Q5/Q8, k-quants), and an Apple-Silicon performance path (SIMD, multithreading, Accelerate). Where that book says "assume you know Zig and the tensor basics," it means this book. You're ready for it.
You can trace the complete path from a sound wave to a transcript and name every component, distinguishing the handful that are new to speech from the many you reused. The arc the book promised — from a dot product to speech recognition — is complete in your understanding. One part remains, and it's the one that turns a correct toy into a usable tool: Part 9 makes any of these models genuinely fast — profiling, SIMD, threads, quantization — and ships a real CLI.
Performance: making it fast
Your models are correct but probably leave most of your machine idle. This part makes them genuinely fast — measured every step — and it stands on its own: profile to find the one loop that matters, vectorize it with SIMD, spread it across cores, and shrink the weights with quantization. Then we ship the whole thing as a real command-line tool. The rule throughout: keep the simple, correct version as a reference and optimize only what the profiler points at.
Profiling — find the one loop that matters
~3–4 daysBefore changing a single line for speed, measure. Optimizing by guesswork wastes effort on code that doesn't matter and risks breaking code that does. The good news: a transformer's time is dominated by one operation, so profiling's job is less "discover the hotspot" and more "confirm it on your machine and get a baseline number to beat."
Benchmark the right binary
A Debug build runs bounds checks, overflow checks, and skips the optimizer — its timings are meaningless for tuning, often 5–20× slower than release. Always measure in ReleaseFast (or ReleaseSafe if you want to keep overflow checks). The very first "optimization" everyone discovers is simply this switch.
A scoped timer you can sprinkle anywhere
You don't need a profiling framework to find a hotspot that's 90% of the runtime — a stack timer that prints when it goes out of scope is enough to rank the stages.
Wrap the top-level stages first — embedding, each block, the final projection — run a generation, and read the split. You'll see the dense matrix multiplies (linear) dwarf everything; attention's softmax and the norms are rounding error by comparison. Drill one level down and time linear specifically: it's the tall pole, and it's where every remaining chapter aims.
A linear layer of shape M×N×K costs about 2·M·N·K floating-point operations. Sum that over the model for one token, divide by your measured time, and you get GFLOP/s. Compare it to your CPU's peak (tens of GFLOP/s per core for scalar f32, several times that with SIMD). If you're far below peak, you have headroom — the premise of the next two chapters. If you're near it, you're memory-bound, and quantization (Chapter 39) is the lever. Measuring this number tells you which fight you're in.
For an LLM the headline metric is decode speed: tokens generated per second (or its inverse, ms/token). Establish it now, in ReleaseFast, on a fixed prompt and length — that single number is what every change in this part is judged against. Real profilers go deeper when you need them: perf on Linux, Instruments' Time Profiler on macOS, both showing line-level hot spots. But the scoped timer plus a tokens/sec baseline is enough to drive the whole optimization.
You have a baseline: tokens/second in ReleaseFast and a per-stage breakdown that names linear as the hotspot. Every optimization from here is a measured before/after against that number. You're optimizing from evidence, not vibes — which is the whole discipline.
SIMD with @Vector
~1 weekModern CPUs can do the same arithmetic on several numbers at once with a single instruction — SIMD (Single Instruction, Multiple Data). One multiply-add instruction can handle 4, 8, or 16 floats in parallel. Zig exposes this portably through @Vector: you write lane-width code once and the compiler lowers it to your CPU's vector unit (NEON on Apple Silicon, AVX on x86). Vectorizing the dot product is the single highest-leverage change in the book, because it's the inner loop of the matmul that dominates everything.
The vectorized dot product
Four Zig builtins do the work: @splat broadcasts a scalar to all lanes, the [i..][0..VEC].* dereference loads a contiguous chunk as a vector, @mulAdd is the fused multiply-add (one rounding, faster and more accurate than separate * then +), and @reduce(.Add, …) sums the lanes at the end. Drop this in as ops.dot and the entire model speeds up several-fold, because every linear calls it.
Build with -Dcpu=native (or your exact chip, e.g. apple_m1) so the backend assumes the right vector width and instructions — @Vector is portable source, but the CPU target decides how aggressively it lowers. The scalar tail loop handles lengths that aren't a multiple of VEC; don't skip it or you'll silently drop the last few elements. Lane width is worth tuning: 8 is a safe default, but measure 4 and 16 — the sweet spot depends on the chip and the loop.
Vectorize the reductions too
Once dot is fast, the profiler may surface the elementwise and reduction kernels. The same pattern vectorizes them: a sum-of-squares for LayerNorm/RMSNorm, the sum in softmax, scaling a buffer. Each is a @Vector accumulate plus a @reduce.
Activations like GELU vectorize too (load a chunk, apply the math lane-wise), though transcendentals such as tanh may not have a vector form on every target — measure before bothering. The dot product is where nearly all the win lives; the rest is cleanup the profiler will or won't ask for.
Keep the scalar dot and its tests. After swapping in the SIMD version, the model's output must match the scalar path to about 1e-6 — the only difference is floating-point reassociation from summing in lanes, far below any token boundary. Assert maxAbsDiff(scalar, simd) < 1e-4 on random inputs; if it's larger, you have a real bug (a bad tail, a misaligned load), not rounding. This is exactly why you built tests in Chapter 6 — they let you rewrite the hottest code fearlessly.
One vectorized dot product and your tokens/second jumps several-fold, with identical output, using only Zig's portable @Vector — no architecture-specific intrinsics. You've hit the CPU's vector units through clean, checked code. Next: do this on every core at once.
Multithreading the matmul
~1 weekA single core now runs the dot product fast; you have several more sitting idle. The matmul is embarrassingly parallel: each output neuron is an independent dot product, so you can split the output across a thread pool with no locks and no shared mutable state. The one rule that keeps it safe: each thread writes a disjoint slice of the output and reads only shared, immutable inputs.
A thread pool over output rows
Initialize the pool once at startup (try pool.init(.{ .allocator = a, .n_jobs = n_threads })) and reuse it for every matmul. The work splits the N output columns into chunks; each job owns a disjoint range, so there's nothing to lock. waitAndWork lets the calling thread pitch in rather than idle.
First, your scratch arena is not thread-safe — never let two jobs allocate from it at once. The design above sidesteps this: the output y is allocated once on the calling thread and jobs only write into their slices; no job allocates. Second, modern chips (Apple Silicon, recent Intel) mix fast performance cores and slow efficiency cores — spawning a thread per logical core can shove work onto the slow ones and regress. Benchmark thread counts; the sweet spot is usually the number of performance cores, not the total.
Splitting columns across threads must not change a single output bit — each y[m*N+n] is computed by exactly one job, the same way, regardless of scheduling. Assert the threaded output equals the single-threaded output exactly, and run a generation a few times: identical tokens every run. If results flicker between runs, you have a data race — almost always overlapping output ranges or a shared arena write. Determinism here isn't a nicety; it's how you know the parallelism is sound.
SIMD × cores. On a multi-core machine your tokens/second is now a large multiple of the scalar baseline, with output unchanged and reproducible. The pure-Zig CPU path is genuinely fast. One lever remains, and it attacks a different bottleneck — memory.
Quantization — smaller, and faster
~1 weekAfter SIMD and threads, your model isn't limited by arithmetic — it's limited by memory bandwidth. Generating each token streams every weight from memory through the CPU, and there are hundreds of millions of them. Quantization attacks this directly: store weights as small integers instead of 32-bit floats, so there are fewer bytes to move. It shrinks the model 4× (or 8×) and, because decode is memory-bound, speeds it up too.
The one idea: blocks with a shared scale
Neighboring weights in a matrix have similar magnitudes, so storing 32 bits each is wasteful. Group them into fixed blocks and store, per block, a single high-precision scale plus a small integer per weight. Reconstruct with weight ≈ scale × quant. The classic 8-bit format, Q8_0, uses blocks of 32: one f16 scale and 32 signed bytes — 34 bytes for what was 128, and nearly lossless.
Two ways to use quantized weights
Dequantize-on-load is the easy path: expand every weight back to f32 right after reading the file, and your whole engine is unchanged. You get the smaller file and can run a quantized model immediately, but you lose the memory/bandwidth win because everything is f32 in RAM. It's the right first step — prove correctness, then optimize.
Packed integer matmul is where the speed lives: keep the weights packed and write a dot product that consumes them directly. The trick (from llama.cpp and ggml): quantize the activation row to 8-bit on the fly too, do integer multiply-adds block by block, and rescale by the two block scales at the end. Integer math is fast, and — crucially — you move far fewer bytes.
It's counterintuitive that lower precision runs quicker — surely less precise means more work? But decode speed is set by how fast weights reach the CPU, not by the arithmetic. A 4-bit weight is one-eighth the bytes of an f32, so the memory system delivers them eight times faster, and the integer math is cheap. This is the same reason GQA helped (smaller KV cache = less traffic). Once you internalize "the bottleneck is bytes moved, not flops computed," quantization stops being mysterious and becomes the obvious next move.
Below 8 bits, Q4_0 packs two 4-bit weights per byte (block of 32 = an f16 scale + 16 bytes) and centers them by subtracting 8: weight = scale × (q − 8). It's 4× smaller than f16 with a small, usually acceptable quality cost. More sophisticated "k-quant" formats use super-blocks with sub-scales for better accuracy per bit — we build Q4_K and Q6_K in-house in Chapter 43 — but Q8_0 and Q4_0 are the core ideas, and they're enough to make your GPT-2 and Gemma engines small and fast.
Quantization trades accuracy for size/speed, so measure the trade. Compute perplexity (Chapter 24) on a fixed passage at f32, then at Q8_0 and Q4_0. Q8_0 should be nearly identical; Q4_0 a touch higher. Freeze a threshold and make it a regression test, so a future change to a dequant formula fails loudly instead of quietly degrading output. And always validate a new format by diffing its dequantized weights against the f32 originals before trusting generation — a wrong block layout produces fluent, confident, subtly wrong text.
You can store and run models in 8-bit and 4-bit, shrinking them several-fold and speeding up memory-bound decode — and you understand why bandwidth, not arithmetic, was the ceiling. Your engine is now fast on a single core, across all cores, and light on memory. All that's left is to wrap it in a tool you'll actually use.
A real CLI
~3–5 daysMake it a tool. Everything so far runs from a hard-coded main; a real inferencer takes a model, a prompt, and the decoding knobs on the command line, and streams text as it generates. No library needed — walk std.process.argsAlloc.
Argument parsing
Stream tokens to stdout
Pipe-able output goes to stdout through a buffered writer; decode and print each token as it's produced so the user sees text appear live (buffer until a complete UTF-8 boundary, per Chapter 18).
Wire parseArgs to your loader, tokenizer, thread pool, and generation loop; stream tokens as they're produced; add a --help. Then point the same binary at GPT-2 and at Gemma with identical flags. You now have one command-line program that runs two different real language models — fast, quantizable, multi-threaded — built entirely from kernels you wrote and understand. Put it on your PATH and use it.
You have a fast, real command-line LLM inferencer that runs GPT-2 and Gemma — profiled, vectorized with SIMD, multi-threaded, and quantized — with the road to Whisper fully mapped. For most people that's a triumphant place to stop. If you want to close the remaining gap to a production engine like llama.cpp — a cache-blocked matmul, fused attention, the k-quant formats, mmap and a quantized KV cache — Part 10 is the deep end.
Toward production
Part 9 made your engine fast the honest way. This part closes the remaining gap to a real CPU inference engine of the llama.cpp/ggml class: a cache-blocked matmul micro-kernel, fused flash-style attention, the k-quant formats brought fully in-house, and the engine architecture — memory-mapped weights, a quantized KV cache, the prefill/decode split — that ties it together. These techniques are more involved and more architecture-sensitive than anything so far; we build the ideas and correct code, and flag honestly where the final increment needs per-chip tuning.
A cache-blocked matmul
~1–2 weeksYour SIMD linear is fast per dot product, but on big matrices it leaves performance on the table because of memory, not math. A naive matmul streams each weight from RAM once per output row it touches, thrashing the cache. Production GEMM (the BLAS at the bottom of every framework) fixes this with blocking: restructure the loops so data, once loaded into cache and registers, is reused as much as possible before being evicted. This is the single biggest single-core win after SIMD.
The memory hierarchy is the whole story
A multiply-add is nearly free; fetching the operands is what costs. Registers are instant, L1 a few cycles, L2 a dozen, RAM hundreds. A naive triple loop reads a weight from RAM, uses it for one multiply, and discards it — paying the RAM price per operation. Blocking arranges the computation so each weight loaded into cache feeds many multiplies before eviction, amortizing the fetch. Two levels: cache blocking (tile the matrices so a working panel fits in L2) and register blocking (a tiny micro-kernel that keeps a small output tile entirely in vector registers across the inner loop).
The register-blocked micro-kernel
The heart of any fast GEMM is a micro-kernel that computes a small MR×NR tile of the output, holding the accumulators in registers so the inner K loop touches no memory but the inputs. Here's a 4×8 tile (4 rows, one 8-wide vector of columns) built by outer-product accumulation:
The inline for unrolls the tile so the compiler keeps acc[0..MR] in distinct vector registers — across the entire K loop they never touch memory. Each iteration does one broadcast load of A, one vector load of B, and MR fused multiply-adds: a very high ratio of arithmetic to memory traffic, which is exactly the goal.
Packing and the blocking loops
For the micro-kernel to read sequentially, you first pack the relevant panel of B (and tile of A) into small contiguous, aligned scratch buffers. The outer loops walk the matrix in cache-sized blocks, pack each block once, and call the micro-kernel across it:
The block sizes MC, NC, KC are chosen so the packed panels fit the L2/L3 caches; the micro-kernel tile MR×NR is chosen for the register file. This is the structure of every high-performance GEMM (the "Goto/BLIS" design). Getting it merely reasonable already beats the simple SIMD loop on large matrices; getting it optimal is where chip-specific tuning lives.
A textbook blocked GEMM gets you most of the way; closing on hand-tuned BLAS needs things we'll name but not chase: tile sizes derived from your exact cache sizes, software prefetch of the next panel, choosing MR×NR to match the register count, and avoiding write-back stalls. These are measured, not derived — and they're why llama.cpp ships per-architecture kernels. Build the blocked version, beat your Part 9 number, and know that the remaining gap is tuning, not a missing idea.
Blocking and packing reorder the arithmetic, so expect tiny floating-point differences — but the output must match your Chapter 9 linear to ~1e-4. Test on non-tile-aligned shapes (dimensions not multiples of MR/NR/block sizes) — the edge handling is where blocked GEMMs break. Keep the simple version as the oracle; you'll lean on it hard while tuning.
You've built the structure of a real GEMM: a register-blocked micro-kernel with accumulators that stay resident, cache-blocked packing loops around it, validated against your simple matmul. This is the engine of BLAS, demystified. Next, do the same memory-aware thinking for attention.
Fused (flash) attention
~1 weekOur attention computes all the scores, softmaxes them, then blends the values — three passes that materialize the full score vector (and, in prefill, an n×n matrix). FlashAttention's insight is that you never need to store the scores at all: with an online softmax, you can stream over the keys in one pass, keeping a running maximum and running sum, and fuse score → softmax → value-accumulate into a single loop. Less memory, fewer passes, better cache behavior — and it's exactly correct.
Online softmax
The trick is incremental, numerically-stable normalization. Maintain a running max m, a running denominator l, and a running weighted-value accumulator. When a new score arrives that exceeds the current max, rescale everything accumulated so far by exp(m_old − m_new) and carry on. At the end, divide by l. No score is ever stored; the result is identical to the three-pass version.
Softmax needs every term divided by the sum of exp(scoreⱼ − max), but you don't know the max until you've seen all scores. The online trick defers nothing: it keeps a provisional max and, whenever it grows, multiplies the already-accumulated sum and value-vector by exp(old_max − new_max) — precisely the correction needed to re-base them to the new max. Algebraically the final result equals the batch softmax to the last bit (modulo float reassociation). It's the same max-subtraction stability you learned in Chapter 10, made incremental.
Why it matters more in prefill
For single-token decode, the score vector is only len long, so the memory win is modest — but the fusion (one pass, no separate softmax buffer, values streamed alongside) still helps cache behavior. The big payoff is prefill and long context, where naive attention is O(n²) in memory for the score matrix. Tiling queries into blocks and applying this online softmax per block — the full FlashAttention — keeps memory flat and makes long-context attention practical. That's why every serious engine uses it.
Run flashHead and your Chapter 28 attention step on the same query and cache; the outputs must match to ~1e-5. Stress it: a key with a huge score (tests the rescale path), all-equal scores (uniform attention), and len = 1. If a large outlier score breaks it, your corr/m update is wrong. This is a kernel where a subtle bug yields plausible-but-off output, so pin it against the simple version.
Your attention is fused and single-pass with an online softmax — the FlashAttention core — validated against the straightforward version. You've applied the same memory-first thinking that drove the blocked matmul to the other half of the transformer. Now make the weights themselves production-grade with k-quants.
k-quants in-house (Q4_K, Q6_K)
~1–2 weeksChapter 39's legacy formats give every block of 32 weights a single scale — blunt, because a block whose values cluster away from zero wastes half its range. k-quants fix this with two ideas: the unit is a super-block of 256 weights, and each super-block is carved into sub-blocks that each get their own scale and minimum, themselves quantized to 6 bits and packed. The result is noticeably better accuracy per bit, which is why modern quantized downloads are almost always k-quants. These are the ggml formats, ported faithfully.
The byte layouts
Unpacking the 6-bit scales (Q4_K)
Q4_K's 12 scale-bytes hold eight 6-bit scales and eight 6-bit mins, bit-packed so nothing is wasted. This helper — a direct port of ggml's get_scale_min_k4 — extracts sub-block j's scale and min. Copy its bit arithmetic exactly; there is no intuitive version.
Q6_K
Q6_K combines a 4-bit low part (ql) and a 2-bit high part (qh) into a 6-bit value, centered by subtracting 32, with 16 signed int8 scales read directly. Two passes of 128 weights; the interleaving below is ggml's exact order.
k-quant unpacking is pure bit-shuffling with no intuition to fall back on — the index order, the nibble/high-bit pairing, the sub-scale stride are all load-bearing. A single wrong index gives a model that's 90% right with occasional wrong words, which is far harder to debug than total garbage. Validate by dequantizing a real k-quant tensor and diffing against the same tensor dequantized by llama.cpp/gguf tooling, element-by-element — not by listening to the output. Treat any nonzero diff as a bug.
As with Q8_0/Q4_0, dequant-on-load gets k-quant files running immediately. The speed win comes from keeping them packed and doing the matmul directly: quantize the activation row to 8-bit, do integer dot products per sub-block, and fold in the super-scale and sub-scale at the end. The structure mirrors Chapter 39's dotQ8, with the extra sub-scale layer. This is the format-specialized inner loop where comptime (Chapter 5) earns its keep — write the kernel generic over the format and let the compiler stamp out a tight, branch-free routine per format, exactly as ggml does with C macros.
You can load and run the k-quant formats modern models actually ship in — Q4_K and Q6_K — fully in-house, with the exact ggml byte layouts and a validation path against the reference. Combined with Q8_0/Q4_0 from Part 9, your engine reads essentially the whole quantization zoo. Now assemble all of this into a real engine.
The inference engine
~1–2 weeksFast kernels are necessary but not sufficient; a production engine is also about memory and lifecycle. Three architectural moves separate a demo from an engine: memory-map the weights instead of reading them, split prefill from decode because they have opposite bottlenecks, and run a zero-allocation steady state. None is exotic; together they're the difference.
Memory-map the weights
readFileAlloc copies the whole model into RAM — fine for GPT-2, ruinous for a multi-gigabyte model, and wasteful because you then hold both the file and a dequantized copy. Instead mmap the file: the OS maps it into your address space and pages in only what you touch, with no copy. Keep the weights in their on-disk quantized form and dequantize per-tile inside the matmul. Two processes running the same model even share the pages.
Now your loader records, per tensor, just an offset and dtype into the mapped blob — no allocation, near-instant "load." The quantized bytes stay on the page; the matmul dequantizes a block at a time into a small reusable buffer (or, better, consumes them packed). This is how llama.cpp opens a 40 GB model in a blink.
Prefill vs decode: opposite bottlenecks
The two phases want different things. Prefill ingests the whole prompt at once: a tall matmul (large M), compute-bound, perfectly suited to the blocked, multi-threaded GEMM of Chapter 41 — parallelize over rows and columns and saturate the cores. Decode generates one token at a time: M = 1, memory-bound, where the win is reading fewer weight bytes (quantization) and a tight KV-cache walk, not raw FLOPs. An engine treats them as distinct code paths.
It's why a long prompt has a one-time "thinking" pause (prefill, compute-bound) and then streams tokens at a steady rate (decode, memory-bound). It's why quantization barely speeds prefill but clearly speeds decode. It's why batching many users' requests together helps throughput (it turns many M=1 decodes into one tall, compute-bound matmul). Internalizing "prefill is compute-bound, decode is memory-bound" lets you predict and optimize an LLM's behavior instead of being surprised by it.
A zero-allocation steady state and a quantized KV cache
In the decode loop, allocate nothing: preallocate every scratch buffer once (sized to the model and context) and reuse it each token. No allocator in the hot path means no fragmentation, no surprise latency. And quantize the KV cache itself — store cached keys and values as int8 with a per-row scale — to roughly halve the cache's memory and the bandwidth you spend reading it each step. At long context the KV cache, not the weights, dominates memory, so this is a real win.
The mmap call above is POSIX (macOS, Linux); Windows uses CreateFileMapping/MapViewOfFile, and Zig's std exposes platform pieces that shift between versions. Wrap it behind a small interface with a readFileAlloc fallback so the engine still runs everywhere, just using more RAM where mmap isn't wired up. And remember mmap'd memory is read-only and backed by the file — never write through it, and keep the file handle's lifetime at least as long as the mapping.
You have an engine, not just kernels: weights memory-mapped and kept quantized, a compute-bound prefill path and a memory-bound decode path, a zero-allocation hot loop, and a quantized KV cache. This is the architecture real inference engines are built on. One chapter left — an honest map of what still separates you from the very best.
The last mile
open-endedYou've built a fast, quantized, memory-mapped, multi-threaded CPU inference engine with fused attention and a real GEMM. That is genuinely most of what llama.cpp is. Honesty matters here, though: a few things still separate your engine from the state of the art, and knowing exactly what they are is the difference between "I built a toy" and "I understand the production frontier and could close it."
Architecture-specific micro-kernels
Zig's @Vector gets you perhaps 80% of peak; the last 20% is hand-tuned per instruction set. Modern CPUs have integer dot-product instructions made for exactly quantized inference — ARM's sdot/i8mm, x86's AVX-512 VNNI — that do a whole int8 dot in one op. Production engines ship a micro-kernel per architecture (sometimes in assembly) to use them. You can reach these from Zig with target-specific builtins or inline assembly; it's effort with diminishing, real returns.
The GPU
The single largest remaining speedup isn't on the CPU at all. Offloading the matmuls to a GPU via Metal (Apple), CUDA (NVIDIA), or Vulkan compute is often an order of magnitude faster for prefill and large models. The kernels you wrote map directly onto GPU compute shaders — same math, massively parallel hardware. This is a project on the scale of Part 9 and 10 combined, and it's where you'd go for serious throughput.
Smarter decoding and serving
Two frontiers beyond raw kernel speed. Speculative decoding: a small, fast "draft" model proposes several tokens and the big model verifies them in parallel in one forward pass — a 2–3× decode speedup with identical output, because decode is memory-bound and verification reuses the same weight read. Serving many users: continuous batching (merge requests into shared matmuls), paged attention (the vLLM idea — a virtual-memory-like KV cache that packs many sequences efficiently), and prompt caching (reuse prefill across requests with a shared prefix). These are systems problems layered on the engine you built.
And the rest of the zoo
Mixture-of-Experts routing for sparse models (only run a few of many MLPs per token), activation quantization with calibration, LoRA adapters merged at load, structured/streaming output, tool-calling harnesses. Every one of them sits on top of the forward pass and kernels you now own. None replaces the core; they extend it.
Step back. You started not knowing Zig and now have a CPU LLM engine with a blocked GEMM, fused flash attention, the full quantization stack including k-quants, memory-mapped weights, a quantized KV cache, and a prefill/decode architecture — running real GPT-2 and Gemma, with a clear path to Whisper. The gap to llama.cpp is now specific and nameable: per-ISA micro-kernels, a GPU backend, and serving infrastructure. That's not a gap of understanding — it's a gap of engineering effort on foundations you've already laid. Very few people who run these models can say that.
From a single dot product to a production-grade inference engine, built kernel by kernel in a language you learned from scratch — GPT-2 and Gemma generating fluently, Whisper mapped, the matmul blocked, attention fused, the weights k-quantized and memory-mapped. You can point at any number anywhere in the stack and say what it is and why. That understanding is the whole point, and it will transfer to every model, every framework, and every optimization you meet for the rest of your career. The appendices that follow are your permanent reference. Now go build it.
Appendices
Permanent lookup material: a Zig cheat sheet for when syntax slips, the minimal math in one place, the exact model shapes and file formats, how to build the validation oracles every chapter leans on, and a symptom-to-cause table for when something breaks.
Zig cheat sheet
The Zig you need for this book, on one page. For everything else, the language reference and std docs are excellent.
Declarations & types
| Code | Meaning |
|---|---|
const x = 3; | immutable binding (preferred) |
var x: f32 = 3; | mutable variable, typed |
u8 u32 usize i32 f32 f64 bool | sized number types; usize for lengths/indices |
@floatFromInt(n) / @intFromFloat(x) | explicit int↔float conversion (required!) |
@as(T, value) | name the target type |
@intCast(x) | convert between integer types |
Arrays, slices, pointers
| Code | Meaning |
|---|---|
[_]f32{ 1, 2, 3 } | array, length inferred |
[]f32 / []const f32 | slice (view): writable / read-only |
arr[a..b] / arr[a..] | sub-slice |
&arr | slice of a whole array |
slice.len | number of elements |
m[r*cols + c] | element (r,c) of a flat row-major matrix |
for (s) |*x| x.* = v; | mutate elements in place (capture by pointer) |
Control flow & functions
Errors, optionals, memory
Structs, enums, comptime, tests
If older tutorials don't compile, these are the usual changes: ArrayList is now unmanaged by default (allocator passed to each method; create with .empty); the build system uses root_module/createModule; the reader/writer interfaces were reworked in the 0.15/0.16 "Writergate"; and the debug allocator may be GeneralPurposeAllocator or DebugAllocator. When in doubt, run zig init and diff. Use std.debug.print(fmt, .{args}) for all development output.
The math you actually need
Every mathematical idea in this book, stated plainly in one place. None of it is more advanced than it looks here.
Vectors & the dot product
A vector is a list of numbers. The dot product of two equal-length vectors is the sum of their elementwise products: a·b = Σ aᵢbᵢ — one number out. It measures alignment: large positive when the vectors point the same way, zero when perpendicular, negative when opposed (a·b = |a||b|cos θ). It's the atom of neural networks: linear layers, attention scores, and output logits are all dot products. (Chapter 7.)
Matrices & matmul
A matrix is a grid of numbers; we store it as one flat array, row-major, element (r,c) at index r·cols + c. Matrix multiply C = A·B sets C[i,j] = (row i of A) · (column j of B); the inner dimensions must match and the outer ones survive ([M,K]·[K,N] = [M,N]). A linear layer applies y = x·Wᵀ + b; storing W as [out,in] makes each output a dot product of the input with one weight row. Cost is ~2·M·N·K operations — the model's dominant expense. (Chapter 9.)
Softmax
Softmax turns a vector of scores into a probability distribution: softmax(x)ᵢ = exp(xᵢ) / Σ exp(xⱼ) — all positive, summing to 1. Always compute it after subtracting the max element (exp(xᵢ − max)); this changes nothing mathematically but prevents exp overflow. Used inside attention (over keys) and at the output (over the vocabulary). (Chapter 10.)
Derivatives & gradient descent
A derivative is a slope: how much the output changes when you nudge an input. The gradient is the vector of partial derivatives over all parameters; it points uphill on the loss. Gradient descent steps the opposite way: w ← w − lr·∂L/∂w, with lr the learning rate. Backpropagation is the chain rule applied mechanically backward through the computation to get every parameter's gradient at once. (Chapters 12, 14.)
Cross-entropy
The loss for predicting a category. With model probabilities p (from softmax) and the correct class y, cross-entropy is L = −log(p_correct) — small when confident and right, huge when confident and wrong. Its gradient with respect to the logits is beautifully simple: p − y (the probabilities minus the one-hot target). This is the loss every language model is trained with; the "class" is the next token. Perplexity = exp(loss), an intuitive "effective number of choices." (Chapter 13.)
The handful of constants
| Where | Quantity | Typical value |
|---|---|---|
| Attention | score scale | 1/√head_dim (Gemma: 1/√query_pre_attn_scalar) |
| LayerNorm / RMSNorm | epsilon | 1e-5 (GPT-2) / 1e-6 (Gemma) |
| RoPE | base θ | 10000 (Gemma 3 global: 1e6) |
| MLP | hidden expansion | 4×d (GPT-2); GeGLU sizes vary |
| Head dimension | d_model / n_head | 64 (GPT-2); 256 (Gemma, decoupled) |
Model shapes & file formats
The numbers and layouts your loader reads. As always, prefer reading these from the model's own config.json over hard-coding them.
GPT-2 family
| Model | d_model | Heads | Layers | Ctx | Vocab | ~Params |
|---|---|---|---|---|---|---|
| gpt2 | 768 | 12 | 12 | 1024 | 50257 | 124 M |
| gpt2-medium | 1024 | 16 | 24 | 1024 | 50257 | 355 M |
| gpt2-large | 1280 | 20 | 36 | 1024 | 50257 | 774 M |
| gpt2-xl | 1600 | 25 | 48 | 1024 | 50257 | 1558 M |
All use learned positional embeddings, pre-LayerNorm (eps 1e-5), tanh-approx GELU, MLP hidden 4×d, tied output head, attention biases present, and Conv1D weights stored transposed ([in,out]). Head dim is 64 throughout.
Gemma family
| Model | d_model | Q / KV heads | head_dim | Layers | Vocab |
|---|---|---|---|---|---|
| Gemma 2 2B | 2304 | 8 / 4 | 256 | 26 | 256000 |
| Gemma 2 9B | 3584 | 16 / 8 | 256 | 42 | 256000 |
| Gemma 3 1B | 1152 | 4 / 1 | 256 | 26 | 262144 |
Common to Gemma: RMSNorm with (1+weight) (eps 1e-6), GeGLU (tanh GELU), RoPE, GQA, no biases, tied head, embedding scaled by √d_model, weights stored in standard [out,in] layout (no transpose). Gemma 2 adds soft-capping (attn 50, final 30) and a pre+post norm sandwich; Gemma 3 uses QK-norm instead of soft-capping and a 5-local:1-global attention pattern with RoPE base 1e6 on global layers. Confirm exact dims (especially intermediate_size and KV-head counts) against the model's config.json.
safetensors layout
The GPT-2 BPE pre-tokenization regex
Applied to split text into chunks before BPE merges. Match it exactly (with Unicode property support) for token parity with reference tokenizers.
It keeps a leading space attached to the following word (so " the" is one chunk), isolates runs of letters, runs of digits, and runs of punctuation, and handles trailing whitespace. GPT-4-era tokenizers use a more elaborate pattern, but this is GPT-2's.
Building oracles
Every "validate against the reference" callout leans on an oracle. Two kinds suffice. NumPy is the kernel oracle — three lines reproduce any matmul, softmax, or norm to check a single function. PyTorch / transformers is the model oracle — it lets you dump any intermediate tensor (a block's output, the logits) and diff it against your Zig at that exact point. When generation is wrong, the model oracle tells you which stage first diverged, which is the entire debugging game.
Set up a throwaway environment
Dump GPT-2 intermediates
For Gemma, swap in AutoModelForCausalLM.from_pretrained("google/gemma-2-2b") and the matching tokenizer. For tokenizer parity specifically, tiktoken is the fastest cross-check: import tiktoken; tiktoken.get_encoding("gpt2").encode("hello world").
Read a .npy file in Zig
A .npy file is a tiny ASCII header followed by the raw array. This parser handles the case you'll dump: little-endian float32, C-contiguous — enough to load any reference above and run maxAbsDiff against your output.
When a model's output is wrong, don't stare at all of it — bisect. Dump embeddings, the output after block 0, and the final logits from Python. In a Zig test, run your pipeline to each point and assert maxAbsDiff < tol (≈1e-3–1e-2, looser for bf16 models). The first stage that fails is where your bug lives. You've turned "the text is wrong" into "block 0's output is off in dimension 40," which is a fixable sentence. Every hard bug in this book yields to this one move.
Troubleshooting — symptom to cause
The bugs you'll actually hit, and where to look first. Most "the model is broken" reports are one of these.
| Symptom | Likely cause |
|---|---|
| Compile error: incompatible types 'usize' and 'f32' | An integer needs an explicit conversion. Wrap with @floatFromInt / @intFromFloat (Chapter 2). |
| GPT-2 emits confident garbage, but loads fine | Forgot to transpose the Conv1D weights (c_attn, c_proj, c_fc), or the Q/K/V split offsets are wrong. The single most common GPT-2 bug (Chapter 26). |
| Totally garbage from every model | Loader bug: wrong safetensors offsets or dtype handling. Spot-check the first few values of one tensor against Python (Chapter 25, Appendix D). |
| NaNs appear after a few tokens | Softmax without max-subtraction overflowing exp, or reading an uninitialized KV-cache slot. Subtract the row max; zero/initialize the cache (Chapters 10, 28). |
| Output ~90% right, occasional wrong words | A subtle constant: wrong GELU (erf vs tanh), wrong LayerNorm epsilon, or attention scale ≠ 1/√head_dim. Diff against the oracle, not by ear (Chapters 10, 26). |
| Generation loops or drones the same phrase | Greedy decoding. Add temperature and top-p/top-k sampling (Chapter 19). |
| Gemma garbage, GPT-2 fine | A missing Gemma fingerprint: embedding scale √d, RMSNorm (1+w), soft-capping, or the attention scale from query_pre_attn_scalar (Chapter 32). |
| Modern model: attention subtly off | Wrong RoPE convention — adjacent-pair vs split-half. Match the reference's rotate_half (Chapter 30). |
| KV-cache generation differs from the slow path | Off-by-one in cache.len (append before vs after attending), or reading cache[len] before writing it (Chapter 28). |
| Tokenizer ids don't match the reference | Missing the pre-tokenization regex or the byte-level mapping. Token parity is non-negotiable (Chapter 18, Appendix C). |
Leak report at deinit | A missing free, or an ArrayList deinit'd with the wrong allocator. Pair every alloc with a defer free; use an arena for scratch (Chapter 4). |
| Everything is correct but painfully slow | You're in a Debug build. Benchmark only in -Doptimize=ReleaseFast (Chapters 27, 36). |
build.zig won't compile after a Zig update | Version drift. Run zig init in a scratch folder and reconcile against its generated files (Chapters 1, 6, Appendix A). |
| Crash/garbage on long inputs | Sequence exceeded n_ctx (GPT-2's positional table is only 1024). Cap or window the input (Chapter 27). |
When something breaks, don't debug the whole pipeline — bisect it with Appendix D. Find the first tensor that diverges from the oracle and you've cut the search from "thirty-six chapters of code" to one function. Every hard bug in this book yields to that one move.
The language, the kernels, training, tokenization, the transformer, real GPT-2 and Gemma inference, the bridge to Whisper, performance, a CLI — and the oracles and tables to keep it all honest. Go build it, a kernel at a time. When you finish, you'll understand these models better than almost anyone who merely runs them. That was the whole point.