From Tokens to States: LLMs as a Special Case of World Models and the Continuous Path Beyond
Summary
This opinion paper argues that large language models are a degenerate special case of world models, not a separate paradigm, and proposes a continuous spectrum from next-token prediction to latent-space architectures like JEPA, examining the data and architecture challenges along this path.
View Cached Full Text
Cached at: 06/29/26, 05:25 AM
# LLMs as a Special Case of World Models and the Continuous Path Beyond
Source: [https://arxiv.org/html/2606.28127](https://arxiv.org/html/2606.28127)
\(Opinion Paper, Draft, June 2026\)
###### Abstract
The AI community has framed the relationship between large language models \(LLMs\) and world models as a dichotomy: LLMs predict tokens; world models simulate reality\.LeCun \([2022](https://arxiv.org/html/2606.28127#bib.bib2)\)argues that reaching general intelligence requires abandoning autoregressive token prediction in favour of latent\-space architectures\. This framing is unnecessarily binary\. Two claims will be defended\. First, LLMs are a degenerate special case of world models: the state space is the set of all token sequences, the only action is appending one token, and world models are therefore a strict generalisation of LLMs, not a replacement\. Second, there is a natural continuous spectrum from NTP to JEPA, with multi\-token prediction, future\-summary prediction, and next\-latent prediction as intermediate stations already populated by current research\. Moving along this spectrum relaxes the LLM constraints one by one\. It also progressively surrenders the two practical advantages that make LLMs trainable at scale: internet\-scale self\-supervised data, and a transformer architecture co\-designed for discrete token prediction\. Both are examined as open research questions: the*data question*\(the cliff from self\-supervised text to instrumented action\-labelled environments\) and the*architecture question*\(whether the transformer generalises to continuous\-state prediction, or whether a new primitive is needed\)\.
###### Contents
1. [1Introduction](https://arxiv.org/html/2606.28127#S1)
2. [2LLMs as a Special Case of World Models](https://arxiv.org/html/2606.28127#S2)1. [2\.1World Model Formalism](https://arxiv.org/html/2606.28127#S2.SS1) 2. [2\.2Formal Embedding of LLMs](https://arxiv.org/html/2606.28127#S2.SS2) 3. [2\.3LLM Constraints Within the World\-Model Class](https://arxiv.org/html/2606.28127#S2.SS3) 4. [2\.4Empirical Support: World Models Inside LLMs](https://arxiv.org/html/2606.28127#S2.SS4)
3. [3The Continuous Spectrum](https://arxiv.org/html/2606.28127#S3)1. [3\.1Step\-by\-Step Analysis](https://arxiv.org/html/2606.28127#S3.SS1) 2. [3\.2The Data Question](https://arxiv.org/html/2606.28127#S3.SS2) 3. [3\.3The Architecture Question](https://arxiv.org/html/2606.28127#S3.SS3) 4. [3\.4LeCun’s Critique Revisited](https://arxiv.org/html/2606.28127#S3.SS4)
4. [4Discussion and Conclusion](https://arxiv.org/html/2606.28127#S4)
5. [References](https://arxiv.org/html/2606.28127#bib)
## 1Introduction
A*world model*tracks the state of an environment and predicts how it evolves over time\. It supports planning by simulating the consequences of actions before committing to them\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.28127#bib.bib1); LeCun,[2022](https://arxiv.org/html/2606.28127#bib.bib2)\)\. LLMs, trained to predict the next token in a sequence, look like an entirely different object\.[LeCun](https://arxiv.org/html/2606.28127#bib.bib2)’s \([2022](https://arxiv.org/html/2606.28127#bib.bib2)\) position paper formalises this intuition\. His Joint Embedding Predictive Architecture \(JEPA\) predicts in latent space rather than token space, conditioned on agent actions\. He explicitly contrasts “word models” with “world models” and argues that token prediction cannot yield the structured representations needed for planning and reasoning\.
This framing is imprecise\. “World model” is not the name of a specific architecture; it is a formal class parameterised by a state space, an action space, and a transition function\. LLMs are a member of this class with specific, restrictive choices for each parameter\. The implication is not that LLMs are good enough, but that the path to more powerful world models is a progressive relaxation of those choices, not a clean break\.
The key empirical support for this view comes from mechanistic interpretability: LLMs trained only on token sequences demonstrably build internal world\-model representations in their hidden activations\.Liet al\.\([2024](https://arxiv.org/html/2606.28127#bib.bib3)\)andNandaet al\.\([2023](https://arxiv.org/html/2606.28127#bib.bib4)\)show that OthelloGPT encodes the full board state linearly in its activations at\>\>99% accuracy, despite seeing only move tokens\. Chess\-playing LMs replicate this\(Karvonen,[2024](https://arxiv.org/html/2606.28127#bib.bib5)\)\. Llama\-2 encodes linear representations of geographic space and calendar time\(Gurnee and Tegmark,[2024](https://arxiv.org/html/2606.28127#bib.bib6)\)\. These results show that the world model is in the activations, not the tokens; the tokens are the interface, not the representation\.
The continuous\-spectrum view is also supported by recent architectures: multi\-token prediction\(Gloeckleet al\.,[2024](https://arxiv.org/html/2606.28127#bib.bib7)\), future\-summary prediction\(Mahajanet al\.,[2026](https://arxiv.org/html/2606.28127#bib.bib10)\), and next\-latent prediction\(Teohet al\.,[2026](https://arxiv.org/html/2606.28127#bib.bib9)\)each populate a point between standard NTP and JEPA\.Liet al\.\([2026](https://arxiv.org/html/2606.28127#bib.bib15)\)empirically confirm that LLMs can function as text\-based world models at sufficient scale\. No prior work makes the formal containment claim or unifies these architectures under a single spectrum argument\.
## 2LLMs as a Special Case of World Models
### 2\.1World Model Formalism
A world model is a tuple\(S,A,T,ρ0\)\(S,A,T,\\rho\_\{0\}\):SSis the state space,AAthe action space,T:S×A→𝒫\(S\)T:S\\times A\\to\\mathcal\{P\}\(S\)the transition function, andρ0\\rho\_\{0\}the initial state distribution\. Planning proceeds by iterative querying ofTTto simulate a trajectory\(s0,a0,s1,a1,…\)\(s\_\{0\},a\_\{0\},s\_\{1\},a\_\{1\},\\ldots\)in imagination before committing to real actions\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.28127#bib.bib1)\)\(Figure[1](https://arxiv.org/html/2606.28127#S2.F1)\)\.
Statests\_\{t\}TransitionT\(st,at\)T\(s\_\{t\},a\_\{t\}\)Statest\+1s\_\{t\+1\}TransitionT\(st\+1,at\+1\)T\(s\_\{t\+1\},a\_\{t\+1\}\)Statest\+2s\_\{t\+2\}actionata\_\{t\}sampleactionat\+1a\_\{t\+1\}sample
Figure 1:A world model iteratively applies the transition function to simulate multi\-step trajectories\. State and action spaces are left intentionally abstract\.
### 2\.2Formal Embedding of LLMs
> Claim:Every autoregressive LLM can be formally embedded in the world\-model formalism, making world models a strict generalisation of LLMs:LLMs⊂World Models\\text\{LLMs\}\\subset\\text\{World Models\}\.
LetVVbe the vocabulary \(a finite set of tokens\)\. Define:
- •S=V∗S=V^\{\*\}: the state space is the set of all finite token sequences \(the current context\)
- •A=VA=V: the only “action” is choosing the next token
- •T\(s,a\)=δs⋅aT\(s,a\)=\\delta\_\{s\\cdot a\}: the transition is*deterministic*, appending tokenaato sequencessyieldss⋅as\\cdot a
- •The LLM providesπθ:V∗→𝒫\(V\)\\pi\_\{\\theta\}:V^\{\*\}\\to\\mathcal\{P\}\(V\): the policy for selecting the next action
Figure[2](https://arxiv.org/html/2606.28127#S2.F2)illustrates this mapping concretely\.
Statests\_\{t\}==token sequence"The cat sat on the"Actionata\_\{t\}==next token"mat"Statest\+1s\_\{t\+1\}"The cat sat on the mat"LLM policyπ\(⋅\|st\)\\pi\(\\cdot\|s\_\{t\}\)deterministic appendFigure 2:The LLM as world model: state==token sequence, action==next token, transition==deterministic append\.One subtlety: in a standard world model,TTencodes external world dynamics andπ\\piis the agent\. In the LLM case, the transition is trivial \(append\) and all content is in the policy\. The LLM simultaneously acts as world simulator and agent\. This conflation is a defining feature of the LLM special case\.
### 2\.3LLM Constraints Within the World\-Model Class
The containment is strict because general world models over continuous state spaces, conditioned on external actions, are not LLMs\. Figure[3](https://arxiv.org/html/2606.28127#S2.F3)shows the resulting subset hierarchy\. Compared to a general world model, LLMs impose five constraints \(Table[1](https://arxiv.org/html/2606.28127#S2.T1)\):
Table 1:The five constraints that characterise LLMs as a special case of world models\.The last two constraints are tightly coupled and explain why LLMs have scaled so well: NTP on text requires no labels, no sensors, no instrumented environments; only internet text \(∼\\sim1013tokens\)\. The transformer architecture was co\-evolved with this objective, making the pairing exceptionally efficient\. These are not bugs, but they are constraints, and relaxing them is what moving up the spectrum means\.
World ModelsarbitrarySS,AA,TTLatent PredictorscontinuousSS, multi\-stepTTMulti\-Token Predictorstoken state,kk\-stepStandard LLMstoken state, 1\-stepGPT\-4oClaudeMistralLlamaGeminiDeepSeekV3Ling\-V2NextLatLLM\-JEPAI\-JEPAV\-JEPA 2Genie 3CosmosFigure 3:Containment hierarchy as nested sets\. LLMs \(innermost box\) are the most constrained special case\. Each outer ring relaxes one constraint\. This is a*subset*relation, not a replacement\.
### 2\.4Empirical Support: World Models Inside LLMs
This framing predicts that LLMs should develop internal world\-model representations in their hidden activations, and they do\. OthelloGPT\(Liet al\.,[2024](https://arxiv.org/html/2606.28127#bib.bib3); Nandaet al\.,[2023](https://arxiv.org/html/2606.28127#bib.bib4)\)is the cleanest demonstration: trained only on move tokens, its hidden activations encode the full board state linearly at\>\>99% accuracy\. The world model is in the*activations*, not in the tokens; the tokens are the interface \(A=VA=V\), while the latent world state lives in the hidden states \(Figure[4](https://arxiv.org/html/2606.28127#S2.F4)\)\.
Input embedding layerMove tokens→\\totoken vectorsTransformer layersHidden activations encode world state:board position⋅\\cdotpiece ownership⋅\\cdotlegal moves\(linearly decodable at\>\>99% accuracy\)Output headHidden state→\\tonext move distributionresidual streamresidual streamworld modellives heretokeninterfaceFigure 4:OthelloGPT is a standard transformer trained only on move tokens\. The token interface \(blue\) handles discrete move symbols; the transformer layers \(green\) are where the world model resides\. This maps directly to Claim 1: tokens are the action spaceA=VA=V; the latent world state lives in the hidden activations\.The same structure appears at scale\.Gurnee and Tegmark \([2024](https://arxiv.org/html/2606.28127#bib.bib6)\)show Llama\-2 encodes linear representations of geographic space and calendar time\.Donget al\.\([2025](https://arxiv.org/html/2606.28127#bib.bib11)\)show prompt\-level hidden states encode global attributes of the entire future response, not just the next token\. The consistent pattern is that LLMs develop internal world representations far richer than their token\-level objective requires\.
## 3The Continuous Spectrum
> Claim:There is a natural continuous spectrum running from NTP to JEPA, with each step relaxing exactly one LLM constraint\. Moving along this spectrum also progressively sacrifices the two practical advantages that make LLMs trainable at scale: internet\-scale self\-supervised data, and a well\-matched transformer architecture\.
NTP / LLMsGPT\-4o, Claude, LlamaPredict: 1 next tokenData: internet text∼1013\\sim\\\!10^\{13\}Arch: transformer ✓✓✓MTPDeepSeek\-V3, Ling\-V2Predict:kknext tokensData: internet text∼1013\\sim\\\!10^\{13\}Arch: transformer ✓✓✓Future SummaryMahajan et al\. 2025Predict: compressed futureData: internet text∼1012\\sim\\\!10^\{12\}Arch: transformer ✓✓Next\-LatentNextLat, LLM\-JEPAPredict: next latent stateData: self\-supervised∼1011\\sim\\\!10^\{11\}Arch: transformer? ✓JEPAI\-JEPA, V\-JEPA 2, CosmosPredict: action→\\tolatentData: instrumented envs∼109\\sim\\\!10^\{9\}Arch: open questionrelaxgranularitycompressoutputlatentstateaddactionsFigure 5:The spectrum from LLMs to JEPA\. Each node shows the prediction objective, training data scale, and architecture fit\. Moving right relaxes one world\-model constraint but also degrades both LLM practical advantages\.### 3\.1Step\-by\-Step Analysis
Figure[5](https://arxiv.org/html/2606.28127#S3.F5)shows the five stations of the spectrum; the following paragraphs examine each transition\.
#### NTP→\\toMTP\.
This step relaxes the “one token per step” constraint\.Gloeckleet al\.\([2024](https://arxiv.org/html/2606.28127#bib.bib7)\)show that predicting the nextkktokens simultaneously viakkindependent heads improves reasoning and code performance \(\+15% on MBPP; adopted in DeepSeek\-V3\)\.Zhonget al\.\([2026](https://arxiv.org/html/2606.28127#bib.bib16)\)provide the theoretical mechanism: MTP promotes convergence toward internal belief states via representational contractivity\. This step costs nothing in data or architecture: same internet\-scale text, same transformer, onlyk−1k\-1extra output heads\. It is a near\-free upgrade\.
#### MTP→\\toFuture Summary\.
This step decouples the prediction target from the token space\.Mahajanet al\.\([2026](https://arxiv.org/html/2606.28127#bib.bib10)\)train an auxiliary head to predict a compressed representation of the long\-term future \(bag\-of\-words or reverse\-LM embedding\), improving maths and reasoning at 3B–8B scale\. Training data remains internet text with modest preprocessing; architecture is unchanged\.
#### Future Summary→\\toNext\-Latent\.
This step moves the prediction target fully out of token space\.Teohet al\.\([2026](https://arxiv.org/html/2606.28127#bib.bib9)\)propose NextLat: a transformer trained to predict its own next latent state, improving planning performance and inference speed \(up to 3\.3×\\times\)\. Crucially, this step can still train on internet\-scale text; supervision comes from the model’s own hidden states\. However, the architecture fit weakens\. Predicting continuous latent vectors rather than discrete tokens requires diffusion\-style output heads or other adaptations, and the transformer’s inductive bias is no longer perfectly matched\.
#### Next\-Latent→\\toJEPA\.
This is where both practical advantages collapse simultaneously\.[LeCun](https://arxiv.org/html/2606.28127#bib.bib2)’s JEPA predicts the latent state of a future observation conditioned on an external agent action\. Training now requires paired \(observation, action, next observation\) sequences from instrumented environments, orders of magnitude scarcer than text \(∼\\sim109samples vs∼\\sim1013tokens\)\. The right architecture is also an open question\. Existing JEPA models \(I\-JEPA, V\-JEPA 2\) work around this by re\-discretising inputs as image or video patches, effectively moving back toward the discrete\-token end of the spectrum\. Whether the transformer generalises to truly continuous action\-conditioned dynamics remains unresolved\.
### 3\.2The Data Question
The first three steps on the spectrum \(NTP, MTP, and Future Summary\) all train on internet text, either directly or with modest preprocessing\. Even Next\-Latent prediction can use internet\-scale corpora: the supervision signal comes from the model’s own hidden states, not from external labels\. This is a crucial observation: moving to latent\-state prediction does not require giving up internet\-scale data, only changing the prediction target\.
The real data cliff is at the final step\. JEPA requires paired \(observation, action, next observation\) sequences from instrumented environments: robotics rigs, driving simulators, game engines, or video with inferred agent actions\. Such data is orders of magnitude scarcer than text \(∼\\sim109samples vs∼\\sim1013tokens\)\. V\-JEPA 2\(Bardes and others,[2025](https://arxiv.org/html/2606.28127#bib.bib18)\)approximates this by treating each video frame transition as an implicit action\. This is a workaround: true action\-conditioned world models require environments that expose which action caused each transition\.
The practical bottleneck therefore*shifts*as you move up the spectrum: at the LLM end it is compute; at the JEPA end it is data collection and environment instrumentation\. This asymmetry is underappreciated in debates that frame the gap as purely architectural\.
### 3\.3The Architecture Question
The transformer’s success with text is not accidental; the architecture and the task were co\-designed\. Self\-attention over discrete positional embeddings, parallelised teacher\-forcing, and a softmax prediction head are all engineered for sequences of discrete tokens\.
So far, the transformer has extended further than expected, but via a recurring workaround: re\-discretisation\. Vision Transformers\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2606.28127#bib.bib17)\)split images into fixed\-size patches and treat each as a token\. V\-JEPA 2 does the same for video\. In both cases the transformer processes a*discretised approximation*of a continuous input, not a truly continuous state\. The architecture’s strength remains coupled to the discrete\-token assumption\.
At the far end of the spectrum, several alternatives have been proposed: state\-space models \(Mamba\) for explicit continuous\-state recurrence; diffusion transformers \(DiT\) for continuous output prediction; hierarchical architectures for multi\-scale temporal planning\. The hypothesis advanced here is that what is missing is an analogous*moment of crystallisation*: a single architecture as well\-matched to continuous sequential prediction as the transformer is to discrete tokens\. The field may be at a similar stage to NLP before 2017: multiple competing approaches, each solving part of the problem, waiting for the unifying abstraction\.
The data and architecture gaps interact: a better architecture for continuous\-state prediction could reduce the need for action\-labelled data by learning more efficiently from self\-supervised signals\. Progress likely requires advances on both fronts simultaneously\.
### 3\.4LeCun’s Critique Revisited
[LeCun](https://arxiv.org/html/2606.28127#bib.bib2)’s \([2022](https://arxiv.org/html/2606.28127#bib.bib2)\) argument has two parts: \(1\) LLMs predict tokens, not world states, so they cannot plan; \(2\) JEPA is the right alternative paradigm\. The spectrum view dissolves both\.
#### On \(1\)\.
The “tokens, not states” claim conflates the*interface*with the*representation*\. As Section[2\.4](https://arxiv.org/html/2606.28127#S2.SS4)shows, OthelloGPT and chess models encode rich world\-state structure linearly in the hidden activations, not in the tokens\. The tokens are the action spaceA=VA=V; the world model is in the hidden states\. LeCun’s critique applies to the interface, not the internal representation\.
#### On \(2\)\.
Chain\-of\-thought \(CoT\) and extended thinking \(DeepSeek\-R1, OpenAI o3\) can be understood as LLMs moving along the spectrum’s step\-granularity axis*without architectural change*\. Generating intermediate reasoning tokens increases the effective planning horizon within the token\-state world model\.Donget al\.\([2025](https://arxiv.org/html/2606.28127#bib.bib11)\)show that even without CoT, the hidden state already encodes a rough plan for the full response\. JEPA is not a different paradigm; it is the far end of the same spectrum \(Figure[6](https://arxiv.org/html/2606.28127#S3.F6)\)\.
DiscreteContinuousSingle\-stepMulti\-stepStandard NTPLong\-horizontoken plannersFull world modelsLatent one\-stepStandard LLMsChain of ThoughtMTPFuture SummaryNext\-LatentLLM\-JEPAJEPA / V\-JEPA 2State typePlanning horizonFigure 6:Architectures in the \(state type×\\timesplanning horizon\) design space\. The dashed diagonal traces the spectrum of Section 3\. CoT sits directly above Standard LLMs; it increases planning horizon without changing the state type\. The spectrum reveals a densely populated intermediate region that LeCun’s binary framing overlooks\.
## 4Discussion and Conclusion
The preceding sections have established that LLMs are world models \(constrained ones\) and that world models are their generalisation\. The practical consequence is that the question is not*whether*to abandon LLMs for world models, but*how far along the spectrum*a given task requires\. Multi\-step reasoning tasks may be well\-served by MTP or CoT; long\-horizon physical simulation requires the JEPA end\. Each step along the spectrum is also a natural initialisation for the next, suggesting an incremental transition rather than a cold start\.
The two open questions that define the research frontier are treated in Sections 3\.2 and 3\.3\. The intermediate spectrum steps \(MTP through Next\-Latent\) are particularly attractive because they preserve both LLM practical advantages\. Specifically, internet\-scale self\-supervised training and a well\-matched transformer architecture\. Meanwhile, these intermediate steps improve world\-model capacity\. The true break comes at the JEPA step, where both advantages disappear simultaneously: training data becomes scarce and the right architecture is unknown\. What may be needed is not just more data but a new architectural primitive for continuous\-state prediction, analogous to what the transformer was for text\.
LeCun is right that the ultimate goal \(action\-conditioned latent\-space world models capable of closed\-loop planning\) is beyond current LLMs\. But the path there is gradient ascent, not a discontinuous jump\. The open questions are what data and what architecture; the destination is already clear\.
This opinion paper was written with the assistance of a constrained to natural language special case world model\.
## References
- A\. Bardeset al\.\(2025\)V\-JEPA 2: self\-supervised video models enable understanding, prediction, and planning\.Note:Meta AI ResearchExternal Links:[Link](https://ai.meta.com/research/publications/v-jepa-2/)Cited by:[§3\.2](https://arxiv.org/html/2606.28127#S3.SS2.p2.4)\.
- Z\. Dong, Z\. Zhou, Z\. Liu, C\. Yang, and C\. Lu \(2025\)Emergent response planning in LLMs\.External Links:2502\.06258,[Link](https://arxiv.org/abs/2502.06258)Cited by:[§2\.4](https://arxiv.org/html/2606.28127#S2.SS4.p2.1),[§3\.4](https://arxiv.org/html/2606.28127#S3.SS4.SSS0.Px2.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.External Links:2010\.11929,[Link](https://arxiv.org/abs/2010.11929)Cited by:[§3\.3](https://arxiv.org/html/2606.28127#S3.SS3.p2.1)\.
- F\. Gloeckle, B\. Y\. Idrissi, B\. Rozière, D\. Lopez\-Paz, and G\. Synnaeve \(2024\)Better & faster large language models via multi\-token prediction\.External Links:2404\.19737,[Link](https://arxiv.org/abs/2404.19737)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.28127#S3.SS1.SSS0.Px1.p1.3)\.
- W\. Gurnee and M\. Tegmark \(2024\)Language models represent space and time\.External Links:2310\.02207,[Link](https://arxiv.org/abs/2310.02207)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.28127#S2.SS4.p2.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)World models\.External Links:1803\.10122,[Link](https://arxiv.org/abs/1803.10122)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.28127#S2.SS1.p1.7)\.
- A\. Karvonen \(2024\)Emergent world models and latent variable estimation in chess\-playing language models\.External Links:2403\.15498,[Link](https://arxiv.org/abs/2403.15498)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p3.1)\.
- Y\. LeCun \(2022\)A path towards autonomous machine intelligence\.OpenReview PreprintMeta AI / Courant Institute, NYU\.Note:Version 2External Links:[Link](https://openreview.net/forum?id=BZ5a1r-kVsf)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.28127#S3.SS1.SSS0.Px4.p1.4),[§3\.4](https://arxiv.org/html/2606.28127#S3.SS4.p1.1)\.
- K\. Li, A\. K\. Hopkins, D\. Bau, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2024\)Emergent world representations: exploring a sequence model trained on a synthetic task\.External Links:2210\.13382,[Link](https://arxiv.org/abs/2210.13382)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.28127#S2.SS4.p1.2)\.
- Y\. Li, H\. Wang, J\. Qiu, Z\. Yin, D\. Zhang, C\. Qian, Z\. Li, P\. Ma, G\. Chen, and H\. Ji \(2026\)From word to world: can large language models be implicit text\-based world models?\.External Links:2512\.18832,[Link](https://arxiv.org/abs/2512.18832)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p4.1)\.
- D\. Mahajan, S\. Goyal, B\. Y\. Idrissi, M\. Pezeshki, I\. Mitliagkas, D\. Lopez\-Paz, and K\. Ahuja \(2026\)Beyond multi\-token prediction: pretraining LLMs with future summaries\.External Links:2510\.14751,[Link](https://arxiv.org/abs/2510.14751)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.28127#S3.SS1.SSS0.Px2.p1.1)\.
- N\. Nanda, A\. Lee, and M\. Wattenberg \(2023\)Emergent linear representations in world models of self\-supervised sequence models\.External Links:2309\.00941,[Link](https://arxiv.org/abs/2309.00941)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.28127#S2.SS4.p1.2)\.
- J\. Teoh, M\. Tomar, K\. Ahn, E\. S\. Hu, T\. Pearce, P\. Sharma, A\. Krishnamurthy, R\. Islam, A\. Lamb, and J\. Langford \(2026\)Next\-latent prediction transformers learn compact world models\.External Links:2511\.05963,[Link](https://arxiv.org/abs/2511.05963)Cited by:[§1](https://arxiv.org/html/2606.28127#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.28127#S3.SS1.SSS0.Px3.p1.1)\.
- Q\. Zhong, H\. Liao, H\. Qin, M\. Zhou, R\. Mao, W\. Chen, and N\. Chao \(2026\)Toward consistent world models with multi\-token prediction and latent semantic enhancement\.External Links:2604\.06155,[Link](https://arxiv.org/abs/2604.06155)Cited by:[§3\.1](https://arxiv.org/html/2606.28127#S3.SS1.SSS0.Px1.p1.3)\.Similar Articles
@MatthieuWyart: LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs …
This paper proves that learning by predicting latent representations (as in world models like JEPA and data2vec) requires exponentially less data than predicting tokens (as in LLMs) for hierarchical data with hidden structure.
Rant: Stop saying LLMs are just “next token predictors.”
A critique of the oversimplified claim that LLMs are 'just next token predictors,' arguing that prediction at scale induces useful representations and capabilities, and that such dismissals confuse objective with learned system.
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform
This paper argues that large language models struggle with causal reasoning and long-horizon planning due to a mismatch between sequence prediction and reasoning over latent environment dynamics, and introduces the Latent Dynamics Inference perspective along with the Flux environment to study these limitations.
How LLMs Actually Work
An in-depth walkthrough of how modern LLMs work, covering core mechanisms from tokenization to next-token prediction, without heavy math.
How LLMs Actually Work (26 minute read)
A detailed walkthrough of how transformer-based LLMs work, covering tokenization, embeddings, attention, and next-token prediction without heavy math.