Inside LLMs

01 - Tokenization

Text gets broken
into tokens

Before anything else, your text is split into subword chunks called tokens. The model never sees raw characters - it sees a sequence of token IDs, each representing a token from a fixed vocabulary of ~50,000 entries.

Enter any word or phrase below to see it split into subword tokens, each with a token ID.

subwords

token IDs

02 - Embedding

Token IDs become
dense vectors

Each token ID is looked up in a giant table to produce a vector of floats, typically thousands of dimensions. A vector is a list of values, in this instance, decimal precise values (0.1, -0.5, 0.8) called floats. Each dimension is one of these float values. Semantically similar words land near each other. This is where meaning first enters the model.

animals

places

verbs

emotions

03 - Architecture

Stacked transformer
layers

The embedded vectors pass through N transformer layers, each refining the representation. Hover any layer to see what it does.

tap for
more

Input tokens

Raw text → token IDs

Embedding + positional encoding

IDs → vectors with position info

Transformer block x N

Self-attention + feed-forward + layer norm

x 32 to 128 layers

Output projection

Full context → one score per word

Softmax → sample

Scores → probabilities → next token

06 - Training

Learning from
prediction error

The model learns entirely by trying to predict the next token, checking how wrong the prediction was, and adjusting. This loop runs billions of times across trillions of tokens, and from it emerges language understanding.

Forward pass

The input runs through every layer from embedding to output. The model produces a probability for every word in the vocabulary as the predicted next token.

Loss calculation

The model checks what the actual next token was and measures how low a probability it assigned to it. The bigger the miss, the higher the score (called the loss) which in turn dictates how aggressively the weights are corrected.

Backpropagation

Working backwards through every layer of the processing network, the model calculates how much each individual weight contributed to the error. This produces billions of gradients (one per weight), which indicate the direction and magnitude of the necessary adjustments.

Weight update

Each weight is nudged slightly in the direction that would have reduced the loss. The Adam optimizer, a sophisticated algorithm, handles this efficiently, and the cycle begins again on the next token, starting back on the new, updated input.

TRAINING LOSS OVER TIME

07 - Q, K, V

The mechanics
of attention

Every token produces three vectors via learned weight matrices. Their interaction determines how meaning flows across the sequence.

Produce Q, K, V

Q = x·Wq K = x·Wk V = x·Wv

Each token's embedding x is multiplied by three separate learned weight matrices, producing three vectors:

Q — Query: "What am I looking for?" Each token uses its Q to search for relevant information in other tokens.

K — Key: "What do I offer?" Each token broadcasts its K so others can decide how relevant it is to their query.

V — Value: "Here's what I'll contribute." If a token is deemed relevant, its V is what actually gets passed along.

Compute attention scores

score(i,j) = Qᵢ · Kⱼ / √d

Every token's Q is compared against every other token's K using a dot product, which essentially measures how well they match. The result is a raw score representing how much attention token i should pay to token j. Dividing by √d ((dimension)the size of the vectors) prevents the scores from getting so large that the math becomes unstable during training.

Normalize with softmax

αᵢⱼ = softmax(scoreᵢⱼ)

The raw scores are run through an activation function called softmax. An activation function is a mathematical function that maps the input values to a range between 0 and 1. This turns "token j scored higher than token k" into a concrete percentage, indicating how much of its attention token i should direct at each other token.

Weighted sum of Values

outputᵢ = Σⱼ αᵢⱼ · Vⱼ

Each token's final output is a weighted blend of every other token's V. Tokens that scored high in the attention step contribute more. "bank" borrows heavily from "river", so its output vector now carries the meaning of a riverbank rather than a financial institution. This is how context collapses into meaning.

08 - RLHF

Making it
helpful

A pretrained model is good at predicting text, not following instructions. Reinforcement Learning from Human Feedback fine-tunes it to be genuinely useful, helpful, honest, and safe.

Pretrained base model

Knows language deeply. Will complete any text including harmful ones.

Supervised fine-tuning

Train on human-written ideal responses to build an instruction-following baseline.

Reward model training

Humans rank model outputs. A separate reward model learns to predict human preference.

RL fine-tuning (PPO)

The policy model is updated to maximize the reward signal, with outputs humans prefer scoring higher.

Deployed assistant

A model that generates text that is useful, safe, and aligned with human intent.

Inside LLMs

Text gets broken
into tokens

Token IDs become
dense vectors

Stacked transformer
layers

Tokens attend
to each other

One token
at a time

Learning from
prediction error

The mechanics
of attention

Making it
helpful

Inside LLMs

Text gets brokeninto tokens

Token IDs becomedense vectors

Stacked transformerlayers

Tokens attendto each other

One tokenat a time

Learning fromprediction error

The mechanicsof attention

Making ithelpful

Text gets broken
into tokens

Token IDs become
dense vectors

Stacked transformer
layers

Tokens attend
to each other

One token
at a time

Learning from
prediction error

The mechanics
of attention

Making it
helpful