@Potatoloogs: How LLMs Actually Work Inside: From Token to Next-Token – A Complete Overview of Nine Core Mechanisms a) Tokenization: The model doesn't read text, it reads integers · Text is first split into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to...
Summary
This article systematically outlines the nine core mechanisms inside modern LLMs, from tokenization to next-token prediction, including tokenization, embedding, positional encoding, attention, multi-head attention, feed-forward networks, etc., and compares architectural differences between various models.
View Cached Full Text
Cached at: 06/08/26, 01:26 PM
How LLMs Actually Work: From Token to Next-Token — A Complete Walkthrough of Nine Core Mechanisms
a) Tokenization: The model doesn’t read text, it reads integers
· Text is first cut into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to a few hundred thousand entries.
· Classic counterexample: asking an LLM to count how many R’s are in “strawberry” — it’s not that the model can’t count, it’s that the model never operates on letters, only on token IDs.
b) Embedding + Positional Encoding: Giving integers meaning and order
· Each token ID corresponds to a row vector in the embedding matrix (for a 7B model, typically 4096 dimensions); vectors for semantically similar words naturally cluster in space — this emerges from training, not from manual design.
· Early models used sinusoidal waves for positional encoding; modern models have largely shifted to RoPE (Rotary Position Embeddings): instead of adding position information into the vector, it rotates the Query and Key vectors so that relative distance naturally appears in the attention computation, without adding parameters.
· Practical implication: even with RoPE, LLMs still suffer from the “lost in the middle” problem — they utilize information at the beginning and end of a prompt much better than content in the middle. Prompt engineering tricks like “put important context first” are genuinely effective.
c) Attention: How tokens exchange information
· Each token plays three roles simultaneously: Query (what am I looking for), Key (what can I match), Value (what gets passed along when matched).
· Interesting mechanism: Anthropic discovered “induction heads” in 2022 — attention heads that specifically identify “A B … A” patterns; when they see the second A, they automatically predict B follows. This is one of the clearest known mechanisms behind in-context learning.
· Attention cost grows quadratically with sequence length — this is the fundamental reason long prompts are expensive.
d) Multi-Head Attention: A common misconception
· Commonly misunderstood: Each attention head does not slice the token vector; instead, it uses an independent projection matrix to map the full vector into a smaller subspace — it’s a different “perspective” on the same token, not a different “slice.”
· Head specialization emerges from training; no one tells each head what to do: some heads track syntax, some handle pronoun resolution, some identify positional patterns.
· GQA (Grouped Query Attention): multiple query heads share fewer key/value heads, dramatically reducing KV cache memory usage with almost no accuracy loss. LLaMA-2 70B has 64 query heads but only 8 KV heads; Mistral 7B works the same way.
e) Feed-Forward Network: The severely underestimated half
· Attention handles communication between tokens; the FFN handles each token’s own deep processing. Both are indispensable.
· Counterintuitive fact: in dense models, most parameters live in the FFN, not in attention.
· FFN as the model’s “notebook”: researchers have shown that directly editing FFN weights can change the model’s factual knowledge without retraining (the ROME method — changing “The Eiffel Tower is in Paris” to “in Rome” requires only a low-rank edit to specific FFN weights).
· MoE (Mixture of Experts): multiple parallel FFNs per layer, with a router activating only a few of them per token. Mixtral 8x7B has 46.7B total parameters, but each token uses only about 12.9B — this is the core idea behind scaling up parameters without linearly increasing inference cost.
f) Where the real differences between models lie
· GPT, Claude, Gemini, LLaMA are largely similar at the architecture level; differences come from three areas: training data and scale, configuration (number of layers, heads, whether MoE), and post-training (instruction fine-tuning, preference alignment, safety controls).
· By 2023–2025, modern transformers have converged on several key designs: Pre-norm, RMSNorm, RoPE, SwiGLU, GQA — different teams independently arrived at the same choices.
Similar Articles
How LLMs Actually Work
An in-depth walkthrough of how modern LLMs work, covering core mechanisms from tokenization to next-token prediction, without heavy math.
How LLMs Actually Work (26 minute read)
A detailed walkthrough of how transformer-based LLMs work, covering tokenization, embeddings, attention, and next-token prediction without heavy math.
@pallavishekhar_: Learn LLM internals step by step - from tokenization to attention to inference optimization - BPE - Tokenization - Tran…
A tweet promoting a resource for learning LLM internals step by step, covering tokenization, attention, and optimization techniques.
Don't let the LLM speak, just probe it (8 minute read)
The article introduces a technique that extracts hidden states from an LLM at the last prompt token to perform classification without text generation, using a small MLP to read the model's internal decision, enabling fast and cheap zero-shot classifiers.
Rant: Stop saying LLMs are just “next token predictors.”
A critique of the oversimplified claim that LLMs are 'just next token predictors,' arguing that prediction at scale induces useful representations and capabilities, and that such dismissals confuse objective with learned system.