@royvanrijn: For curious developers I built "The Anatomy of an LLM", an interactive explainer showing how text becomes tokens, vecto…
Summary
An interactive visual guide that explains how large language models work, from tokenization through attention, transformer blocks, and text generation, built by Roy van Rijn.
View Cached Full Text
Cached at: 05/29/26, 02:09 PM
For curious developers 🧠
I built “The Anatomy of an LLM”, an interactive explainer showing how text becomes tokens, vectors, attention, transformer blocks, and finally generated text.
https://t.co/fgCeZuQwJf
The Anatomy of an LLM | Interactive Visual Guide to How Language Models Work
Source: https://www.royvanrijn.com/anatomy-of-an-llm/ Introduction
Introduction
Large language models can feel like black boxes. You type a prompt, something smart comes back, and somewhere in the middle billions of parameters supposedly did “AI”.
This guide opens that box.
We will follow one chain from beginning to end. First, text is split into tokens. Those tokens become vectors. The vectors move through layers of attention and feed-forward networks. At the end, the model produces scores for possible next tokens, and a decoding strategy chooses what comes out.
The goal is not to memorize every formula. The goal is to understand what changes at each step, and why that step exists at all.
If you are looking for how LLMs work, how transformers work, or how attention, tokenization, KV cache, and quantization fit together, this page keeps those ideas connected in one visual path.
By the end, you should be able to trace the full path:
01Text
02Tokens
03Vectors
04Transformer blocks
05Logits
06Sampling
07Output
And once you can trace that path, the black box becomes a lot smaller.
What you get
Concrete visuals, small numbers first, and interactive controls that make each transformation inspectable.
How to use it
Scroll top to bottom as a single narrative, or jump between chapters for a specific concept.
Who made this
Roy van Rijn working atopenvalue
Table of contents
- 01Tokenization
- 02Vector Embeddings
- 03Neuron Activation
- 04Feed-Forward Neural Network
- 05Logits and Sampling
- 06Backpropagation
- 07Optimizers
- 08Attention: Q, K, and V
- 09Multi-Head Attention
- 10RoPE
- 11Transformer Block
- 12Training Phases
- 13Post-Training
- 14Context and KV Cache
- 15Quantization
Chapter 01
Tokenization
Before a model can think about text, the text has to become numbers.
A language model does not read words and sentences the way we do. It reads a sequence of token IDs: integers produced by a tokenizer.
That makes tokenization the real entrance to the model. Everything after this point works with numbers, not raw characters.
A token can be a whole word, part of a word, punctuation, whitespace, or a piece of something strange like code, emoji, or a name. This is why tokenization often looks a bit weird when you first see it. The tokenizer is not trying to split text the way a human would. It is trying to represent text efficiently using a fixed vocabulary.
If every token were a full word, the vocabulary would explode. If every token were a single character or byte, every sentence would become very long. Modern tokenizers live between those extremes.
Slicing up the text
Before text can enter a language model, it has to be rewritten as numbers.
Tokenization is the step that does this. It splits text into small reusable pieces calledtokens. A token can be a whole word, part of a word, punctuation, a number, or even a space plus the start of the next word.
Each token has an entry in the tokenizer’s vocabulary and is replaced by its corresponding integer ID. From that point on, the model is no longer working with characters directly. It sees an ordered list of token IDs.
Why not just use words?
Whole words are too rigid. New names, typos, code, inflections, compound words, and multilingual text would constantly produce words the model has never seen before.
Why not just use letters or bytes?
That solves the “unknown word” problem, but makes every input much longer. More pieces means more work for the model and less context fits in the same window. Subword tokens are the reasonable compromise: common text stays compact, while unusual text can still be built from smaller pieces.
Below you can experiment with OpenAI’so200k\_basetokenizer. Try switching sentences and watch where the boundaries land.
Later in this explainer, when the model predicts thenexttoken, it predicts over this same vocabulary.
Technical note: the examples below are generated withtiktokenusing theo200k\_baseencoding.
Example sentence
Raw sentence
If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.
102characters
22tokens
5chars/token on average
Tokenized result
If
#3335
·the
#290
·human
#5396
·brain
#12891
·were
#1504
·so
#813
·simple
#4705
·that
#484
·we
#581
·could
#2023
·understand
#4218
·it
#480
,
#11
·we
#581
·would
#1481
·be
#413
·so
#813
·simple
#4705
·that
#484
·we
#581
·couldn’t
#21149
.
#13
Show token IDsShow whitespace markers
Important takeaway
Tokenization is not just preprocessing. It determines what the model can see in one context window, how expensive your text is, and which pieces the model is allowed to predict next.
One word is not one token
Different models use different tokenizers. The same sentence can become a different number of tokens depending on the model.
Chapter 02
Vector Embeddings
Token IDs are just labels. Embeddings turn those labels into something the network can work with.
After tokenization, every token is represented by an integer ID. But an ID by itself has no useful geometry. Token15339is not “close to” token15340in any meaningful way. The numbers are just labels, like row numbers in a table.
The embedding layer solves this by turning each token ID into a vector: a list of learned numbers. Technically, this is a lookup. The model has an embedding matrix, and each token ID selects one row from that matrix.
Conceptually, this is the moment where discrete symbols enter a continuous space. Once tokens become vectors, the model can compare them, combine them, rotate them, project them, and gradually reshape them.
The values inside these vectors are learned during training. Tokens that appear in similar contexts often end up with related vectors, but this is not a clean dictionary of meanings. It is more like a messy, high-dimensional coordinate system full of useful signals.
The initial embedding is mostly context-free. The token “bank” starts with the same embedding in “river bank” and “investment bank”. Later layers use surrounding tokens to rewrite that vector into something more specific.
From token ID to embedding vector
Embedding lookup
After tokenization, each token ID is used as an index into an embedding table. The selected row is a high-dimensional vector that becomes the model’s starting representation for that token.
For readability, this chapter uses a toy embedding width of 24 dimensions. Real model widths are usually much larger, common production widths include 768, 1024, 1536, 3072, and even higher.
An embedded vector is just a list of floating point numbers:dog=[0.7292, -0.3786, 0.1065, 0.3674, 0.1902, -0.7881, ... ]
Example sentenceToken in sentence If
->
token ID #3335
->
embedding row 3335
Embedding values (24 dimensions)
This explainer shows all 24 values from the toy vector.
0\.2173
0\.5424
0\.264
\-0\.9419
\-0\.5084
0\.0872
\-0\.6438
0\.164
\-0\.2094
0\.6078
0\.9056
\-0\.5944
0\.1676
\-0\.0086
\-0\.6874
\-0\.5004
\-0\.4561
\-0\.168
0\.443
\-0\.6566
\-0\.184
\-0\.4863
0\.679
\-0\.044
The same token ID always maps to the same embedding vector.
In real models, these embedding values are learned during training. Tokens that appear in similar contexts are gradually moved to useful regions of vector space, so the vectors end up encoding patterns the model can build on.
Tokens that often play similar roles get nudged in similar directions. For example, the tokenscat,dog, andrabbitoften appear in sentence templates like“The ___ is sleeping“,“I fed the ___”, or“The ___ ran away“. Because they appear in similar contexts, their vectors may end up close together.
Butcatandcarusually appear in very different contexts, so their vectors tend to end up farther apart.
The embedding space is not hand-designed. Nobody tells the model “put animals over here” or “put verbs over there”. Those patterns emerge because moving the vectors that way helps the model predict text better.
2D analogy intuition
Distances between embedding vectors often similar if they have a similar relationship.
puppydogkittencat
Important takeaway
An embedding is the token’s starting representation, not its final meaning. The rest of the model will keep rewriting that vector as context flows through the network.
Toy scale
In this explainer we use small vectors because they fit on screen. Real models use much wider vectors: hundreds, thousands, or more dimensions per token.
Chapter 03
Neuron Activation
A weighted sum is not enough. The non-linearity is where the network gets expressive.
A neuron takes inputs, multiplies them by weights, adds them together, and produces a number. But if that were the whole story, deep learning would not be very deep.
Without activation functions, stacking layers would still behave like one large linear transformation. You could multiply matrices together and collapse the whole stack into a single matrix.
The activation function breaks that linearity. It decides how much of a signal passes through. Some values are amplified, some are softened, some are pushed toward zero.
This lets the network build curved, conditional, non-linear transformations instead of only scaling and rotating vectors. Real models do this in huge batches using matrix operations, with millions of activations happening at once.
Single-neuron transformation
A neuron takes inputs, applies weights, and then runs the result through an activation function. This non-linear step is what lets networks model richer patterns.
z = w1\*x1 \+ w2\*x2 \+ w3\*x3
output = activation\(z\)
Neuron diagram
w1=1.10w2=-0.85w3=0.55x10.70x2-0.25x30.45Σz=1.23GELUout1.10 Inputs
x10.70x2-0.25x30.45 Weights
w11.10w2-0.85w30.55 Activation
Activation functionSmoothly gates values by magnitude instead of hard clipping. Common in transformer blocks; a bit heavier to compute than ReLU.
Neuron output1.0953
Activation curve
x = 0y = 0Marker position updates live as the weighted inputzchanges.
Important takeaway
The activation function is not decoration. It is what lets stacked layers become more than one big linear calculation.
Modern choices
Modern transformer models may use GELU, SiLU, or gated variants like SwiGLU. The exact choice changes both the forward signal and how gradients flow during training.
Chapter 04
Feed-Forward Neural Network
A real layer is not one neuron. It is many simple computations running in parallel.
A single neuron is a useful teaching tool, but models do not process one neuron at a time. A feed-forward network applies many learned transformations in parallel.
Instead of drawing every neuron and every connection, implementations usually express the same thing as matrix multiplication. The friendly diagram says inputs flow through neurons. The implementation says multiply a matrix, apply an activation, multiply another matrix.
Those are the same story at different scales.
In transformer blocks, the feed-forward part usually works position by position. Each token vector is expanded into a wider hidden representation, passed through a non-linearity, and projected back to the model width.
Attention moves information between positions. The feed-forward network transforms the information inside each position.
Dense layer math, visually
Instead of training a full network here, we focus on one forward pass. A dense layer simply means every node in one layer connects to every node in the next layer.
The same math from one neuron is now done in parallel using matrices:
X\(1x2\) · W1\(2x3\) = Z1\(1x3\), thenA1 = activation\(Z1\), thenA1\(1x3\) · W2\(3x2\) = Z2\(1x2\).
Matrix multiplication is just many weighted sums at once. Each output column is one neuron, and each row in the input contributes through its matching weight row.
Fully connected view
XW1A1W2A2x1x2h1h2h3o1o2Hover the top labels to inspect matrices. Green border means firing, red means suppressed.
x10.80x2-0.30Activation
Matrix inspector
Hover one of the top labels (X,W1,A1,W2,A2) to inspect that matrix and the multiplication step.
How multiplication maps to connections
ColumnjinW1contains weights feeding hidden neuronj. Rowicorresponds to input featurei. So each hidden pre-activation is:z1\_j = x1\*w1\_1j \+ x2\*w1\_2j.
The second layer repeats that pattern withA1as input:z2\_k = a1\_1\*w2\_1k \+ a1\_2\*w2\_2k \+ a1\_3\*w2\_3k. This is exactly the graph computation, just vectorized.
In matrix form, we avoid writing each neuron separately:\[x1 x2\] · W1 = \[z1\_1 z1\_2 z1\_3\], then activation applies element-wise to produceA1. ThatA1row is then multiplied byW2to produce both output neurons at once.
Example from the current sliders:z1\_1 = \+0\.80\*\+0\.70 \+ \-0\.30\*\+0\.10 = \+0\.53. If activation suppresses this value (for example ReLU on negative values), that path contributes less or zero to the next layer.
Important takeaway
The feed-forward network is where each token vector gets rewritten. It is not about moving information between tokens; it is about transforming the representation at each token position.
Matrix view
The matrix view is not a less intuitive version of the neuron diagram. It is the scalable version of the same computation.
Chapter 05
Logits and Sampling
The model does not directly output a word. It outputs scores for possible next tokens.
After the model has processed the input, it still has not chosen a word. What it has produced is a vector of raw scores: one score for every token in the vocabulary. These scores are called logits.
A logit is not a probability. It is just an unnormalized score. Higher usually means “the model thinks this token fits better here”, but the numbers do not yet add up to 100%.
To turn logits into probabilities, we apply softmax. Then comes decoding: the policy for choosing the next token from that distribution.
Greedy decoding always picks the most likely token. Temperature changes the shape of the distribution. Top-k limits the choice to the k most likely tokens. Top-p, also called nucleus sampling, chooses from the smallest group of tokens whose total probability passes a threshold.
The model produces the distribution. The decoder decides how adventurous we are when sampling from it.
From logits to generated output
A model converts the final hidden vector into one score per vocabulary token. Those raw scores are logits. Softmax turns them into probabilities, and sampling chooses the next token.
01Hidden
02Vocab projection
03Logits
04Softmax(T)
05Probabilities
06Sampled token
Temperature1.00ModeTop-k (optional)
Logits
Probabilities (after softmax)
Sampled output
Generated sequence (10 tokens):
\(click generate\)
Important takeaway
The model usually does not contain one fixed answer. At each generation step, it produces a probability distribution over possible next tokens.
Token by token
A chatbot answer is built one token at a time. After each sampled token, the new token is added to the context and the process repeats.
Chapter 06
Backpropagation
To learn, the model needs to know which parameters helped cause the mistake.
Training starts with a simple question: how wrong was the model?
The model predicts a distribution over the next token. We know which token actually came next in the training text. The loss measures how far the prediction was from that target.
But measuring the loss is not enough. The model has billions of parameters. Which ones should change? And by how much?
Backpropagation answers that question. It sends the error signal backward through the computation graph and calculates gradients: how sensitive the loss is to each parameter.
The core idea is the chain rule. Every operation only needs to know how its output changes with respect to its input. By chaining those local derivatives together, training can calculate how a tiny change deep inside the model would affect the final loss.
Error becomes learning signal
We will train on one tiny example and reveal each step in order: forward prediction, backward gradients, then the weight update.
01Select target
02Forward snapshot
03Calculate backward
04Apply update
Step 1 - Predict from input
Three, two, one...___
Learning rate0.35
Important takeaway
Backpropagation is not a second mysterious intelligence inside the model. It is an efficient way to calculate gradients through a large composed computation.
Three passes
Forward pass: make a prediction. Backward pass: calculate how to change the parameters. Optimizer step: actually change them.
Chapter 07
Optimizers
Gradients point downhill. Optimizers decide how to walk.
A gradient tells us which direction should reduce the loss. But it does not fully answer how to update the model.
How big should the step be? Should we trust the current gradient completely? Should we remember previous gradients? What if different parameters have wildly different gradient scales?
That is the job of the optimizer.
SGD, or stochastic gradient descent, is the simplest common version. It looks at a small batch of training examples, calculates the gradient, and takes one step in the direction that should reduce the loss. It is direct and easy to understand, but each step can be noisy because it only sees a slice of the training data.
Momentum improves on this by remembering direction. If gradients keep pointing roughly the same way, momentum builds speed. If they zigzag, momentum smooths the path.
Adam tracks both a moving average of the gradients and a moving estimate of their scale. That lets it adapt update sizes per parameter.
The optimizer is not just a detail after backpropagation. It is part of the learning behavior.
Different update rules, same gradients
Backprop gives gradients. Optimizers decide how to turn those gradients into actual parameter updates.
01Same gradients
02Different update rules
03Different trajectories
Optimizer trajectories on one toy loss surface
min
SGD
loss start:3\.3000
loss end:0\.0118
delta:\-3\.2882
Momentum
loss start:3\.3000
loss end:0\.2362
delta:\-3\.0638
Adam
loss start:3\.3000
loss end:0\.1374
delta:\-3\.1626
Learning rate0.110Steps18
All optimizers see the same gradients. Their update rules differ, so their paths differ.
Important takeaway
Gradients tell the model where improvement may be. The optimizer decides how aggressively and in what style the model moves there.
Same gradients, different path
SGD, Momentum, and Adam can start from the same point and see the same gradients, yet follow different paths because each optimizer keeps different internal state.
Chapter 08
Attention: Q, K, and V
Attention lets tokens pull useful information from other tokens.
Embeddings alone are too context-free. Take a word like “mole”. It might mean a small animal, a mark on skin, a spy, or a unit in chemistry. The starting embedding is the same token representation, but the meaning depends on the surrounding words.
The model needs a way for tokens to talk to each other. That is what attention does.
For each token, the model creates three learned views: query, key, and value. The query represents what this token is looking for. The key represents what this token can be matched on. The value represents the information this token can contribute.
The model compares queries to keys to produce attention scores. Those scores are turned into weights, and the weights are used to mix the value vectors. So Q and K decide where information flows. V is the information that flows.
How tokens exchange information
Right now we only have tokens. But sentences encode extra meaning through relationships between nearby words and references.
Select one token to inspect which key tokens it matches with (arrows), then how those weights mix into one updated value representation.
Context Scenario
A blue car crashed into a concrete wall, it was speeding.
Sentence Tokens
Pick any token to compute attention links and value mixing.
Important takeaway
Attention is information routing. Query and key determine relevance; value carries the content that gets mixed in.
Self-attention
In self-attention, tokens attend to other tokens in the same sequence. In a decoder-only LLM, causal masking prevents a token from attending to future tokens during generation.
Chapter 09
Multi-Head Attention
One attention pattern is useful. Many attention patterns in parallel are much more powerful.
A sentence contains many kinds of relationships at once. An adjective may modify a noun. A pronoun may refer to something earlier. A closing bracket may match an opening bracket. A verb may depend on the subject.
One attention head can learn one way of routing information. But one routing pattern is not enough. Multi-head attention runs several attention heads in parallel. Each head has its own learned projections, so each head can learn a different kind of relationship.
After the heads produce their outputs, those outputs are combined and projected back into the model dimension. This does not mean every head has a clean human-readable job. Attention weights are useful clues, not perfect explanations.
Modern models often use grouped-query attention. Groups of query heads share key/value heads, reducing memory use during inference, especially in the KV cache, while keeping much of the benefit of many query heads.
Raw scores -> softmax weights -> value mixing
We also introducemulti-head attentionhere. In modern Transformer models each block doesn’t just have a single attention head, but multiple. Different heads can learn different routing patterns, then their outputs are combined.
Each token creates three learned views of itself:
Q- the question this token asks.
K- what this token advertises about itself.
V- the information this token contributes.
For one selected query token, we compare itsQvector with everyKvector.
Only after softmax do these scores become attention weights. Those weights decide how much of each V vector is mixed into this token’s next representation.
Tensor Shapes
We start with token vectors, project them intoQ,K, andV, compute query-key compatibility scores, then convert those scores into attention weights and mix values.
Q = XWq->K = XWk->V = XWv->scores = QK^T / sqrt\(d\_k\)->weights = softmax\(scores\)->output = weights·V
This example uses unmasked self-attention, so every token can attend to every token. A GPT-style causal decoder would mask future tokens.
Which token is asking a question?
Selected token:blue
Its query asks: “Which other tokens help me understandblue?”
Emphasizes modifiers routing to the noun they describe (for example blue -> car).
Q View
blueembedding\[\+0\.200, \+0\.900, \+0\.400\]
↓ multiply byWq
Q\_blue = \[\+0\.310, \+0\.720, \+0\.650\]
K View
Each token embedding timesWkgives its advertised key vector.
K\_The,K\_blue,K\_car,K\_hit,K\_the,K\_wall
V View
Each token embedding timesWvgives value content to mix if attended.
V\_The,V\_blue,V\_car,V\_hit,V\_the,V\_wall
Raw Query-Key Scores (Not Attention Yet)
Q \ KThebluecarhitthewallThe+0.212+0.191+0.276+0.366+0.190+0.297blue+0.307+0.384+2.270+0.703+0.284+0.261car+0.425+1.293+1.122+0.988+0.393+1.124hit+0.400+0.503+0.739+0.846+0.363+0.798the+0.193+0.184+0.265+0.339+0.173+0.283wall+0.415+0.730+1.047+0.978+0.382+1.473
Step 1 · Selected Query Dot Keys
bluequery ·Thekey =\+0\.307
bluequery ·bluekey =\+0\.384
bluequery ·carkey =\+2\.270
bluequery ·hitkey =\+0\.703
bluequery ·thekey =\+0\.284
bluequery ·wallkey =\+0\.261
Step 2 · Softmax To Attention Weights
softmax\(\[\+0\.307, \+0\.384, \+2\.270, \+0\.703, \+0\.284, \+0\.261\]\)
Row sum:7\.9 \+ 8\.6 \+ 56\.4 \+ 11\.8 \+ 7\.7 \+ 7\.6 = 100\.0%
Step 3 · Weighted Value Mix
Attention decides which value vectors get mixed into this token’s next representation.
output\[1\] = sum\_i weights\[1,i\] \* V\[i\]
head\_output\_blue = \[\+0\.587, \+0\.970, \+0\.680\]
Highest attention target:car(56.4%).
Important takeaway
Multi-head attention gives the model several ways to route information at the same time. Grouped-query attention is a practical modern variant that makes this cheaper during inference.
Interpretation caveat
Attention heads are not little thought modules. They are learned projections that may specialize, overlap, or behave in ways that are hard to summarize cleanly.
Chapter 10
RoPE
Attention needs to know order. RoPE gives position information directly to the attention mechanism.
Attention compares tokens by content. But language also depends on order. “Dog bites man” and “man bites dog” contain the same words, but they do not mean the same thing.
Older transformer explanations often describe positional encodings as vectors added to token embeddings. That works, but many modern decoder-only models use something more integrated with attention: RoPE, or Rotary Positional Embeddings.
RoPE rotates parts of the query and key vectors based on their token positions. When attention compares a query with a key, the comparison should depend on both content and relative position.
Because RoPE modifies Q and K, it changes the attention scores. It does not directly rotate the value vectors, and it does not decide attention by itself. It changes which query/key pairs line up well.
Relative position through rotation
**Problem.**Attention sees tokens, but it also needs word order.dog bites manandman bites dogcontain the same words, but positions change meaning.
**Naive idea.**One option is to add a position vector to each token. RoPE does something different.
**RoPE idea.**RoPE makes attention position-aware by rotatingQandKvectors according to token position before their dot product is computed. It does not rotateV.
Word Order Matters
dog bites manis not the same asman bites dog
Same tokens, different positions. RoPE makesQ·Ksensitive to that position change.
Same Token, Different Position
Example sentence:The small dog chased the ball\.
In this visual, clicking a word temporarily treats that word as relative index0. RoPE is relative in this sense: if you look from a different token, the position offsets change, so the rotations you compare change too.
Click any token to make it the reference frame. That token stays unrotated while all other tokens rotate relative to it.
Relative offset insight
The selected tokenTheis the anchor. Other tokens rotate by their position difference to this anchor. In the dot product, the important angle istheta\_m \- theta\_n, so compatibility depends on relative offsetm \- n.
In this toy pair,dot\(before rotation\) = \+0\.734anddot\(after RoPE rotation\) = \-0\.157. As positions change, relative angle changes, and the query-key dot product changes too.
Multi-frequency pairs
Real vectors have many dimension pairs. Different pairs rotate at different speeds: fast pairs capture nearby offsets, while slow pairs preserve longer-range position patterns.
Connect back to attention
RoPE changes the score matrix before softmax. It does not directly decide attention by itself; it changes whichQ/Kpairs are compatible at different relative positions. RoPE gives attention a position-dependent bias, and the model still has to learn how to use it.
Important takeaway
RoPE injects position into attention by rotating query and key vectors. It helps the model reason about relative position while computing attention.
Compatibility, not payload
RoPE affects compatibility, not payload. Q and K are rotated; V is not the main carrier of positional rotation here.
Chapter 11
Transformer Block
This is where the pieces become the repeated structure of the model.
A transformer is built by stacking blocks. Each block takes in a sequence of token vectors and returns a sequence of token vectors with the same basic shape. The rows still correspond to token positions. The width is still the model dimension.
What changes is the information inside those vectors.
A modern decoder block usually normalizes the input, applies attention so tokens can exchange information, adds the result back through a residual connection, normalizes again, applies a feed-forward network, and adds that result back too.
The residual stream is the running representation that moves through the network. Attention mixes information between positions. The feed-forward network transforms each position. Normalization helps keep values stable. Residual connections preserve a path for information and gradients through many layers.
Layer by layer, the initially context-free embeddings become rich context-aware representations.
One modern decoder block, end-to-end
This chapter combines what we learned into one full transformer block: normalization, multi-headed attention, residual paths, and a feed-forward network.
Let’s look at an actual example of how all these elements are combined to build one Transformer block in a modern decoder-only model.
Click any block part to inspect its role, input/output dimensions, and jump back to the chapter where that part was introduced in detail.
Modern Decoder Block Dimensions
Reference style:Modern Llama\-style decoder block dimensions
Sequence length shown8Model width (d\_model)4096Layers32Query heads32KV heads8Head width (d\_head)128Q shape per token32 x 128K/V shape per token8 x 128Concat attention output4096FFN hidden width14336NormRMSNormPosition encodingRoPEMLPSwiGLUAttentioncausal + grouped-query attentionBlock input/output shape[8 x 4096]
Input XRMSNorm 1Causal GQA + RoPEQ/K/V -> scores -> mix+ ResidualskipRMSNorm 2SwiGLU MLP4096 -> 14336 -> 4096+ ResidualskipY
How This Scales In A Full Model
One block is rarely used alone. Decoder-only Transformers repeat this block many times before the final output projection over the vocabulary. In a Llama-8B-style setup, this is typically around32stacked blocks (layers).
Input TokensTransformerBlock 1TransformerBlock 2...Logits
Important takeaway
A transformer block keeps the sequence shape mostly stable while repeatedly changing what each token vector represents.
Modern decoder details
In Llama-like models, you also see choices such as RMSNorm, RoPE, SwiGLU-style feed-forward layers, causal attention, and grouped-query attention.
Chapter 12
Training Phases
Training is not magic. It is many small prediction errors turned into parameter updates.
From the outside, training often looks like one smooth curve going down. Reality is messier.
At the basic level, pretraining is simple to describe: show the model a lot of text and train it to predict the next token. It makes a prediction, measures the loss, computes gradients, and updates parameters.
Repeat that billions or trillions of times, and the model slowly becomes better at modeling text. But “loss goes down” is not the whole story.
Some patterns are learned early. Others appear much later. A model can improve on training data before it generalizes well. Sometimes better generalization arrives surprisingly late.
For large language models, training is also a scaling problem. Model size, dataset size, data quality, sequence length, optimizer settings, batch size, and compute budget all interact.
How behavior changes across training
Training is often staged, not perfectly smooth: fast fitting first, slower consolidation, and sometimes delayed generalization.
This chart is an illustrative curve, not a claim about one exact production run.
Toy training curve (loss vs optimization steps)
stepslossStep markerTrain lossValidation loss
Auto-detected phase summary
Phase 1: Fit training data
**Train:**Training loss falls quickly.
**Validation:**Validation improves a bit, then slows.
Model memorizes useful local patterns first.
What is being learned in this phase
In large-scale pre-training, the model is mostly learning broad structure: world knowledge, language regularities, code patterns, and reasoning traces from text continuation.
This is why early improvements can look mostly statistical, while later improvements reflect better internal representations. The model is not yet being optimized for assistant behavior such as refusal style or helpful tone.
Where alignment and safety enter
Alignment behavior is primarily shaped after pre-training. Post-training adds objectives such as following instructions, refusing unsafe requests, formatting answers clearly, asking clarifying questions, and staying helpful.
So this chapter is mostly about capability learning dynamics; the next chapter focuses on behavior shaping.
Important takeaway
Pretraining teaches broad capability through next-token prediction. The loss curve is a useful signal, but it is only one view of what the model is learning.
Loss is not the whole story
A lower loss generally means better prediction. It does not automatically mean better reasoning, better honesty, or better assistant behavior.
Chapter 13
Post-Training
Pretraining gives the model capability. Post-training shapes how that capability behaves.
A pretrained language model has learned a huge amount about text. It can continue patterns, imitate styles, answer some questions, write code, and represent many facts and concepts.
But that does not automatically make it a good assistant. A base model is trained to predict likely next tokens. If you ask it a question, it might answer, but it might also continue the prompt, imitate a webpage, produce messy completions, or behave inconsistently.
Post-training teaches the model how we want it to respond. Instruction tuning shows the model examples of prompts and good task-oriented answers. Preference tuning compares possible answers and trains the model toward the ones people prefer: clearer, safer, more useful, better formatted, less rambling.
Different systems use different methods: supervised fine-tuning, RLHF, DPO, constitutional approaches, and many variations. The details differ, but the high-level goal is the same.
From capability to assistant behavior
Pre-training creates broad capability; post-training shapes behavior. The same underlying model can respond very differently depending on which training stage it has gone through.
In practice, we can think of this as: pre-training learnsknowledge and patterns, while post-training learnsassistant behavior.
Capability vs Behavior
Pre-training
world knowledge, language, code, reasoning patterns
Post-training
follows instructions, refuses unsafe requests, formats answers, asks clarifying questions, uses a helpful tone
Three-stage pipeline
1. Base model (after pre-training)
**Objective:**Predict next token over large text/code corpora.
**Signal:**Web, books, code, and other broad unlabeled text.
Key message: pre-training gives broad latent capability, while instruction and preference tuning mostly steer behavior, format, and alignment.
Alignment and safety are not one switch; they are reinforced through multiple post-training signals, evaluations, and policy constraints.
Example prompt:
Explain why the sky is blue\.
Sunlight passes through the atmosphere and shorter blue wavelengths scatter more than longer wavelengths\. This process is called Rayleigh scattering and makes the sky appear blue from most viewing angles\.
Not every model is trained with RLHF-style preference optimization. Some models stop at supervised instruction tuning, while others add direct preference objectives.
The goal is to make outputs more helpful, safer, and better aligned with human expectations when multiple answers are all technically plausible.
In short: pre-training teaches what the modelcansay, while preference tuning helps steer what itshouldsay in assistant contexts.
How RLHF-Style Preference Tuning Works
Step 1 · Candidate answers
For one prompt, generate multiple candidate responses from the current model.
Step 2 · Pairwise ranking
Human raters (or policy-based systems) choose which answer is better in pairs. Example:A \> Bfor helpfulness and safety.
Step 3 · Preference objective
Train a preference signal from those comparisons, then optimize the model so preferred responses become more likely.
Mini pairwise example
Prompt:How can I recover a deleted file?
**Answer A:**Gives clear, cautious, platform-specific recovery steps.
**Answer B:**Vague and omits safety checks.
Ranking:A \> B(more useful and safer).
Important takeaway
Pretraining mostly teaches what the model can do. Post-training strongly influences how, when, and in what style the model does it.
Assistant behavior
A post-trained assistant is not just a base model with more facts. It is a base model whose behavior has been shaped toward following instructions and user preferences.
Chapter 14
Context and KV Cache
Generating text one token at a time would be painfully wasteful without caching.
Decoder-only language models generate text autoregressively: one token at a time. Each new token depends on the tokens before it. So after generating a token, the model appends it to the context and runs another step to predict the next one.
Naively, this would repeat a lot of work. If the prompt has already been processed, why recompute the same keys and values for all earlier tokens again and again?
The KV cache solves that. During attention, the model computes key and value vectors for each token. These are exactly the things future tokens need when they attend back to previous context. So the model stores them.
During generation, each new token only needs to compute its own new keys and values and attend to the cached previous ones. The cache saves compute, but it uses memory. The longer the context, the larger the KV cache becomes.
It helps to separate two phases: prefill processes the prompt and builds the initial cache; decode generates new tokens one by one while reusing the cache.
Compute-memory tradeoff during inference
Decoding is autoregressive: each new token is generated after all previous tokens. KV cache changes the cost by reusing key/value tensors from earlier steps instead of recomputing them every time.
Decode setup
Prompt/context length10,581 tokensGenerated tokens63 tokens Autoregressive decode loop
Compute reduction from caching
62.8xless repeated attention work in this toy estimate
Without cache
At each step, recompute attention keys/values for the full seen sequence.
Relative compute:668,619
Memory behavior: lower KV storage, higher repeated compute.
With cache
Reuse stored K/V from previous tokens; compute only for the new token each step.
Relative compute:10,644
Estimated KV memory:34\.1 MBfor10,644seen tokens.
These values are illustrative relative estimates. Exact memory and speed depend on architecture, precision, head counts, and runtime implementation.
Important takeaway
The KV cache is not a summary of the conversation. It is stored attention data that avoids recomputing previous keys and values during generation.
Speed vs memory
KV cache speeds up repeated attention over previous tokens, but it increases memory use as the context grows.
Chapter 15
Quantization
Big models are often limited by memory. Quantization makes them smaller by storing numbers with fewer bits.
Neural networks are mostly numbers. A large language model contains billions of weights, and during inference it also creates intermediate activations and KV-cache tensors. Storing all of that at high precision takes a lot of memory.
Quantization reduces that memory pressure by representing numbers with fewer bits. Instead of storing a weight as a 16-bit or 32-bit floating-point value, we may store an approximation using 8 bits, 4 bits, or another compact format.
The basic trade-off is simple: less precision -> less memory -> often faster or cheaper inference -> some approximation error.
But “4-bit” or “8-bit” is not the whole story. Different quantization methods make different choices. Some quantize only weights. Some also quantize activations. Some protect outlier channels. Some target the KV cache.
This is why two 4-bit models can behave differently. For local inference, quantization can be the difference between a model that does not fit in memory and a model that runs comfortably.
Bit-width vs quality and memory
Quantization stores model weights with fewer bits. The goal is to reduce memory and make local inference more practical, while accepting a small quality trade-off.
Quantization selector
**FP32:**Maximum precision, largest memory footprint.
Bits per value:32bits
Stored directly as floating-point values.
Weight Matrix (FP32)
Quantized values at selected precision
+0.18371234-1.20491236+0.00712091+2.91823411-0.55291337+0.44204588-0.99123817+1.33100214-0.22345518+0.07620133+3.12019843-2.01444274+0.55193302-0.04721129+1.77231055-0.80911403+2.20133044-1.48320182+0.19441726-0.00990127+0.61544281-0.33611945+1.00993218-2.44211706+0.43120572Unique values in this 5×5 matrix:25
Value range:\-2\.44211706to\+3\.12019843
8B Model Size (Guestimate)
Tradeoff: lower precision can slightly reduce accuracy or response quality, but it is often the key enabler for running strong models locally on consumer hardware.
Why numbers still look like floats in INT8/INT4: the model stores compact integers, then runtime kernels dequantize them back to approximate floating-point values during compute.
This chapter uses simplified estimates and symmetric quantization for intuition; real runtimes also include metadata, activation precision choices, and kernel-specific optimizations.
Important takeaway
Quantization is controlled approximation. It reduces memory and often improves practical inference, but the quality depends on what is quantized and how.
A family of trade-offs
Quantization is not one technique. It is a family of compression and inference trade-offs.
Closing
Putting It All Together
You have now followed the full path through a language model.
Text becomes tokens. Tokens become vectors. Attention moves information between positions. Feed-forward layers rewrite each token representation. Transformer blocks repeat that pattern many times. The final representation is projected into logits. Softmax and sampling turn those logits into the next token.
Then the new token is appended, and the process repeats.
Training teaches the model to build useful internal representations by predicting text. Post-training shapes those capabilities into assistant-like behavior. During inference, techniques like KV caching and quantization make the whole system practical enough to run at interactive speed.
This guide simplified many details on purpose. Real production LLMs include data pipelines, distributed training, specialized GPU kernels, safety systems, evaluation loops, alignment methods, serving infrastructure, and many engineering trade-offs.
But the core path is now visible:
02Vectors
exchange information
03Layers
rewrite representations
04Final state
becomes a next-token distribution
The black box is still big, but it is no longer sealed.
Final takeaway
The model is no longer just “AI magic”. It is a chain of transformations that can be traced and reasoned about.
What we simplified
Real models use huge datasets, distributed training, mixed precision, specialized kernels, safety systems, and many architecture-specific details.
Where to go next
Watch visual explanations, read illustrated transformer walkthroughs, implement a tiny transformer, experiment with tokenizers, and compare real model configs.
References
- 3Blue1Brown — Neural Networks / Transformers: a visual, math-friendly series explaining neural networks, attention, and transformer internals.
- Jay Alammar — The Illustrated Transformer: a classic visual explanation of the original Transformer architecture and attention flow.
- Andrej Karpathy — Neural Networks: Zero to Hero: a code-first path from tiny neural networks to building a GPT-style model from scratch.
- Stanford CS336 — Language Modeling from Scratch: a modern course on building language models end-to-end: data, tokenization, training, scaling, evaluation, and deployment.
- Attention Is All You Need: the original Transformer paper.
Similar Articles
@techNmak: Build LLMs from Scratch Found this gem from Vizuara, a 43-lecture series that actually delivers on its promise: buildin…
A 43-lecture series by Vizuara teaches how to build LLMs from scratch, covering transformer architecture, GPT internals, tokenization, and attention mechanisms with full Python implementations.
@GitHub_Daily: Want to understand how Large Language Models actually work? Existing resources are either too academic and hard to digest, or too superficial, focusing only on concepts, with nothing that clearly explains the entire process from start to finish. Similarly, I came across the 'how-llms-work' project, which turns the complete workflow of LLMs into a visual interactive webpage, based on Andrej Karpathy’s...
An interactive visual guide, 'how-llms-work', breaks down the entire lifecycle of Large Language Models based on Andrej Karpathy's lectures, covering data collection to post-training.
@_vmlops: How LLMs Generate Text End-to-End Inference Pipeline A Mock Interview Guide https://drive.google.com/file/d/1eDqEtWWtIe…
This guide explains the end-to-end inference pipeline of LLMs, serving as a mock interview resource for understanding text generation.
@Tabbu_ai: https://x.com/Tabbu_ai/status/2058145123444347339
An educational thread explaining 11 key lessons for understanding and building LLM architectures from scratch, covering tokens, embeddings, attention, positional encoding, data quality, and common misconceptions.
LLMs 101: A Practical Guide (2026 Edition)
A comprehensive practical guide to LLMs covering inference mechanics, tokens, Transformers, KV cache, local deployment hardware, and quantization as of May 2026.