@Potatoloogs: How LLMs Actually Work Inside: From Token to Next-Token – A Complete Overview of Nine Core Mechanisms a) Tokenization: The model doesn't read text, it reads integers · Text is first split into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to...

X AI KOLs Timeline 06/08/26, 07:44 AM News

llm transformer tutorial deep-learning attention tokenization

Summary

This article systematically outlines the nine core mechanisms inside modern LLMs, from tokenization to next-token prediction, including tokenization, embedding, positional encoding, attention, multi-head attention, feed-forward networks, etc., and compares architectural differences between various models.

How LLMs Actually Work Inside: From Token to Next-Token – A Complete Overview of Nine Core Mechanisms a) Tokenization: The model doesn't read text, it reads integers · Text is first split into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to hundreds of thousands of entries · Classic counterexample: Asking an LLM to count how many R's are in "strawberry" — it's not that the model can't count, but that the model never operates on letters, only on token IDs b) Embedding + Positional Encoding: Giving meaning and order to integers · Each token ID corresponds to a row vector in the embedding matrix (typically 4096 dimensions for a 7B model); semantically similar word vectors cluster together in space — this emerges from training, not manually specified · Early positional encodings used sine waves; modern models have largely switched to RoPE (Rotary Position Embedding): instead of adding position information into the vector, it rotates Query and Key so that relative distances naturally appear in attention computation, with no additional parameters · Practical implication: Even with RoPE, LLMs still suffer from the "lost in the middle" problem — information at the beginning and end of a prompt is utilized significantly more than the middle. Prompt techniques like "put important context at the very beginning" are genuinely effective. c) Attention: How tokens exchange information · Each token plays three roles simultaneously: Query (what am I looking for), Key (what can I match), Value (if matched, pass along) · An interesting mechanism: In 2022, Anthropic discovered the "induction head" — attention heads specialized in recognizing "A B … A" patterns; upon seeing the second A, they automatically predict B follows. This is one of the clearest known mechanisms behind in-context learning · The computational cost of attention grows quadratically with sequence length, which is the fundamental reason long prompts are expensive. d) Multi-Head Attention: A common misconception · The common misunderstanding: each attention head does not slice the token vector, but instead uses an independent projection matrix to map the full vector into a smaller subspace — it's different "perspectives" on the same token, not different "slices" · The specialization of heads emerges from training; no one tells each head what to do: some track syntax, some handle pronoun resolution, some identify positional patterns · GQA (Grouped Query Attention): multiple query heads share fewer key/value heads, significantly reducing KV cache memory usage with almost no accuracy loss. LLaMA-2 70B has 64 query heads with only 8 KV heads; Mistral 7B is similar. e) Feed-Forward Network: The severely underestimated half · Attention handles communication between tokens; FFN handles deep processing within each token — both are indispensable · A counterintuitive fact: In dense models, most parameters reside in the FFN, not in attention · The FFN serves as the model's "notepad": researchers have found they can directly edit FFN weights to change factual knowledge in the model without retraining (ROME method — changing "The Eiffel Tower is in Paris" to "in Rome" requires only a low-rank edit to specific FFN weights) · MoE (Mixture of Experts): multiple parallel FFNs per layer, a router activates only a few for each token. Mixtral 8x7B has 46.7B total parameters, but each token uses only about 12.9B — this is the core idea for scaling up parameters without linearly increasing inference cost. f) Where the real differences lie between models · GPT, Claude, Gemini, LLaMA are largely similar at the architectural level; differences mainly come from three areas: training data and scale, configuration (layers, heads, MoE or not), post-training (instruction fine-tuning, preference alignment, safety controls) · From 2023 to 2025, modern transformers have converged on several key design choices: Pre-norm, RMSNorm, RoPE, SwiGLU, GQA — different teams independently arrived at the same decisions.

Original Article

View Cached Full Text

Cached at: 06/08/26, 01:26 PM

How LLMs Actually Work: From Token to Next-Token — A Complete Walkthrough of Nine Core Mechanisms

a) Tokenization: The model doesn’t read text, it reads integers
· Text is first cut into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to a few hundred thousand entries.
· Classic counterexample: asking an LLM to count how many R’s are in “strawberry” — it’s not that the model can’t count, it’s that the model never operates on letters, only on token IDs.

b) Embedding + Positional Encoding: Giving integers meaning and order
· Each token ID corresponds to a row vector in the embedding matrix (for a 7B model, typically 4096 dimensions); vectors for semantically similar words naturally cluster in space — this emerges from training, not from manual design.
· Early models used sinusoidal waves for positional encoding; modern models have largely shifted to RoPE (Rotary Position Embeddings): instead of adding position information into the vector, it rotates the Query and Key vectors so that relative distance naturally appears in the attention computation, without adding parameters.
· Practical implication: even with RoPE, LLMs still suffer from the “lost in the middle” problem — they utilize information at the beginning and end of a prompt much better than content in the middle. Prompt engineering tricks like “put important context first” are genuinely effective.

c) Attention: How tokens exchange information
· Each token plays three roles simultaneously: Query (what am I looking for), Key (what can I match), Value (what gets passed along when matched).
· Interesting mechanism: Anthropic discovered “induction heads” in 2022 — attention heads that specifically identify “A B … A” patterns; when they see the second A, they automatically predict B follows. This is one of the clearest known mechanisms behind in-context learning.
· Attention cost grows quadratically with sequence length — this is the fundamental reason long prompts are expensive.

d) Multi-Head Attention: A common misconception
· Commonly misunderstood: Each attention head does not slice the token vector; instead, it uses an independent projection matrix to map the full vector into a smaller subspace — it’s a different “perspective” on the same token, not a different “slice.”
· Head specialization emerges from training; no one tells each head what to do: some heads track syntax, some handle pronoun resolution, some identify positional patterns.
· GQA (Grouped Query Attention): multiple query heads share fewer key/value heads, dramatically reducing KV cache memory usage with almost no accuracy loss. LLaMA-2 70B has 64 query heads but only 8 KV heads; Mistral 7B works the same way.

e) Feed-Forward Network: The severely underestimated half
· Attention handles communication between tokens; the FFN handles each token’s own deep processing. Both are indispensable.
· Counterintuitive fact: in dense models, most parameters live in the FFN, not in attention.
· FFN as the model’s “notebook”: researchers have shown that directly editing FFN weights can change the model’s factual knowledge without retraining (the ROME method — changing “The Eiffel Tower is in Paris” to “in Rome” requires only a low-rank edit to specific FFN weights).
· MoE (Mixture of Experts): multiple parallel FFNs per layer, with a router activating only a few of them per token. Mixtral 8x7B has 46.7B total parameters, but each token uses only about 12.9B — this is the core idea behind scaling up parameters without linearly increasing inference cost.

f) Where the real differences between models lie
· GPT, Claude, Gemini, LLaMA are largely similar at the architecture level; differences come from three areas: training data and scale, configuration (number of layers, heads, whether MoE), and post-training (instruction fine-tuning, preference alignment, safety controls).
· By 2023–2025, modern transformers have converged on several key designs: Pre-norm, RMSNorm, RoPE, SwiGLU, GQA — different teams independently arrived at the same choices.

How LLMs Actually Work: From Token to Next-Token — A Complete Walkthrough of Nine Core Mechanisms

Similar Articles

How LLMs Actually Work

How LLMs Actually Work (26 minute read)

@pallavishekhar_: Learn LLM internals step by step - from tokenization to attention to inference optimization - BPE - Tokenization - Tran…

Don't let the LLM speak, just probe it (8 minute read)

Rant: Stop saying LLMs are just “next token predictors.”

Submit Feedback

Similar Articles

How LLMs Actually Work (26 minute read)

@pallavishekhar_: Learn LLM internals step by step - from tokenization to attention to inference optimization - BPE - Tokenization - Tran…

Don't let the LLM speak, just probe it (8 minute read)

Rant: Stop saying LLMs are just “next token predictors.”