Inside LLMs

From raw text to generated language - a visual walkthrough of every step inside a large language model.

Start exploring
01 - Tokenization

Text gets broken
into tokens

Before anything else, your text is split into subword chunks called tokens. The model never sees raw characters - it sees a sequence of token IDs, each representing a token from a fixed vocabulary of ~50,000 entries.

Enter any word or phrase below to see it split into subword tokens, each with a token ID.

subwords
token IDs
02 - Embedding

Token IDs become
dense vectors

Each token ID is looked up in a giant table to produce a vector of floats, typically thousands of dimensions. A vector is a list of values, in this instance, decimal precise values (0.1, -0.5, 0.8) called floats. Each dimension is one of these float values. Semantically similar words land near each other. This is where meaning first enters the model.

animals
places
verbs
emotions
03 - Architecture

Stacked transformer
layers

The embedded vectors pass through N transformer layers, each refining the representation. Hover any layer to see what it does.

tap for
more
Input tokens
Raw text → token IDs
Embedding + positional encoding
IDs → vectors with position info
Transformer block x N
Self-attention + feed-forward + layer norm
x 32 to 128 layers
Output projection
Full context → one score per word
Softmax → sample
Scores → probabilities → next token
04 - Self-Attention

Tokens attend
to each other

Every token looks at every other token and decides how much to "borrow" from them. Click a token to see where its attention focuses.

Select a token above to see its attention pattern.
05 - Generation

One token
at a time

The model generates text autoregressively: it produces one token, appends it to the input, then runs the whole forward pass again. Every token is a fresh prediction.

PROMPT
The transformer architecture was first introduced in
COMPLETION
06 - Training

Learning from
prediction error

The model learns entirely by trying to predict the next token, checking how wrong the prediction was, and adjusting. This loop runs billions of times across trillions of tokens, and from it emerges language understanding.

01
Forward pass
The input runs through every layer from embedding to output. The model produces a probability for every word in the vocabulary as the predicted next token.
02
Loss calculation
The model checks what the actual next token was and measures how low a probability it assigned to it. The bigger the miss, the higher the score (called the loss) which in turn dictates how aggressively the weights are corrected.
03
Backpropagation
Working backwards through every layer of the processing network, the model calculates how much each individual weight contributed to the error. This produces billions of gradients (one per weight), which indicate the direction and magnitude of the necessary adjustments.
04
Weight update
Each weight is nudged slightly in the direction that would have reduced the loss. The Adam optimizer, a sophisticated algorithm, handles this efficiently, and the cycle begins again on the next token, starting back on the new, updated input.
TRAINING LOSS OVER TIME
07 - Q, K, V

The mechanics
of attention

Every token produces three vectors via learned weight matrices. Their interaction determines how meaning flows across the sequence.

1
Produce Q, K, V
Q = x·Wq   K = x·Wk   V = x·Wv
Each token's embedding x is multiplied by three separate learned weight matrices, producing three vectors:

Q — Query: "What am I looking for?" Each token uses its Q to search for relevant information in other tokens.

K — Key: "What do I offer?" Each token broadcasts its K so others can decide how relevant it is to their query.

V — Value: "Here's what I'll contribute." If a token is deemed relevant, its V is what actually gets passed along.
2
Compute attention scores
score(i,j) = Qᵢ · Kⱼ / √d
Every token's Q is compared against every other token's K using a dot product, which essentially measures how well they match. The result is a raw score representing how much attention token i should pay to token j. Dividing by √d ((dimension)the size of the vectors) prevents the scores from getting so large that the math becomes unstable during training.
3
Normalize with softmax
αᵢⱼ = softmax(scoreᵢⱼ)
The raw scores are run through an activation function called softmax. An activation function is a mathematical function that maps the input values to a range between 0 and 1. This turns "token j scored higher than token k" into a concrete percentage, indicating how much of its attention token i should direct at each other token.
4
Weighted sum of Values
outputᵢ = Σⱼ αᵢⱼ · Vⱼ
Each token's final output is a weighted blend of every other token's V. Tokens that scored high in the attention step contribute more. "bank" borrows heavily from "river", so its output vector now carries the meaning of a riverbank rather than a financial institution. This is how context collapses into meaning.
08 - RLHF

Making it
helpful

A pretrained model is good at predicting text, not following instructions. Reinforcement Learning from Human Feedback fine-tunes it to be genuinely useful, helpful, honest, and safe.

Pretrained base model
Knows language deeply. Will complete any text including harmful ones.
Supervised fine-tuning
Train on human-written ideal responses to build an instruction-following baseline.
Reward model training
Humans rank model outputs. A separate reward model learns to predict human preference.
RL fine-tuning (PPO)
The policy model is updated to maximize the reward signal, with outputs humans prefer scoring higher.
Deployed assistant
A model that generates text that is useful, safe, and aligned with human intent.