@snowboat84: https://x.com/snowboat84/status/2064135804092645410

X AI KOLs Timeline 06/09/26, 12:01 AM News

world-models concept-overview yann-lecun feifei-li schmidhuber ai-history deep-learning openai-sora

Summary

This article systematically reviews the evolution of the world model concept from Craik's psychological metaphor in 1943 to the industry explosion in 2024-2026. It details the core ideas and representative works of symbolic AI and deep learning schools (Schmidhuber-Ha, Dreamer series, JEPA, video generation direction), and points out the current state of definition confusion and competition among various schools.

https://t.co/HqI4aCvywd

Original Article

View Cached Full Text

Cached at: 06/09/26, 12:48 PM

What is a World Model? A Concept Under Contestation

World models are extremely hot right now. Yann LeCun left Meta to found AMI Labs working on this direction, raising a $1.03 billion seed round in March 2026. Fei-Fei Li’s World Labs has accumulated $1.23 billion in funding, releasing its product Marble in late 2025. OpenAI earlier pushed the term “world simulator” into the mainstream with Sora (though Sora was shut down by OpenAI in March 2026). Many believe world models are the next AI direction to replace large language models (LLMs).

But at the same time, the definition, methods, and evaluation standards of the term “world model” are extremely chaotic. Opening any survey paper from 2024 onward will acknowledge this: usage is confused, literature is fragmented, and various factions are defining it in their own ways. OpenAI with Sora says “we built a world simulator that generates realistic video”, Yann LeCun founding AMI Labs says “that path is fundamentally wrong; we need to predict causality in abstract space”, Fei-Fei Li launching Marble says “the world is 3D, you can’t stay in 2D video”, the reinforcement learning community uses DreamerV3 saying “who cares, if it’s useful for planning, it’s a good world model”. Same word, but each camp has its own definition and its own evaluation criteria.

This article aims to clarify three things for you: what exactly is a world model, how many schools and definitions exist under it, and what are the methods and representative works of each school. After reading, the next time you see the term “world model”, you can at least ask: which school are you talking about?

1. Conceptual Origins: From Psychological Metaphor to Neural Network

World model is not a new term that popped up out of nowhere in 2024. It has a conceptual history of over 80 years, spanning three stages: psychology, symbolic AI, and reinforcement learning. Each stage added a new layer of meaning to the term.

1.1 The 1943 Conceptual Prototype: Craik’s “Small-Scale Model”

First, a clarification: Craik did not directly propose the term “world model” in 1943. The phrase he used was “small-scale model of external reality”. This is the earliest prototype of the “world model” concept, but the term itself came later.

In his 1943 book The Nature of Explanation, British psychologist Kenneth Craik wrote a passage that is frequently cited today: If an organism carries a “small-scale model” of external reality and its own possible actions within its head, it can try out various alternatives, decide which is best, react to future situations before they happen, and use past experience to handle the present and future.

Craik was a genius who died young (he died in a cycling accident in 1945 at age 31), but his metaphor of the brain having a “small-scale model of external reality” laid the foundational hypothesis for later cognitive science. The theoretical system of “mental models” in cognitive psychology during the 1970s-1980s basically started from Craik.

Kenneth Craik

When Craik proposed this concept, the word “AI” didn’t exist yet (McCarthy coined “Artificial Intelligence” in 1956), nor did the term “world model”. But Craik gave AI an initial goal: to let machines possess internal models similar to the human brain, capable of predicting the consequences of actions. Today, almost every paper about world models cites Craik 1943 in the introduction as the conceptual origin.

1.2 The 1960s Symbolic AI Era: Blocks World and SHRDLU

AI’s real attempt to implement a similar “internal world representation” in machines began in the 1960s (the field was then called “knowledge representation”; the term “world model” would not appear until Schmidhuber 1990). The most representative example is Terry Winograd’s SHRDLU system at MIT in 1972, which could understand natural language commands within a strictly limited “blocks world”.

SHRDLU’s world contained only a few types of blocks (red cubes, green triangles, blue pyramids, etc.) and a tabletop. Users conversed with the system in English: “Put the large red block on top of the green triangle.” SHRDLU could parse the command, plan the actions, execute the operation, and even answer questions about what it had just done. At the time, this was an astonishing achievement, giving the AI community over a decade of optimism that “general intelligence is just around the corner.”

SHRDLU System

But SHRDLU’s internal world representation was a completely hand-coded symbolic system. Every object’s attributes, every action’s preconditions and results, were written line by line by programmers. This method worked in the blocks world but was almost impossible to scale to the real world. The types of objects, relationships, and physical rules in the real world are too numerous to hand-code.

The “AI winter” from the late 1980s to the early 1990s was largely due to the dead end faced by symbolic AI. This research path of “giving machines internal world representations” also entered a decades-long lull, only returning to the mainstream under the new name “world model” with the resurgence of neural networks in the 1990s.

1.3 Deep Learning Enters: Schmidhuber 1990 + Ha 2018

A dynamics model sounds abstract, but it’s essentially a function: given the current state of an Agent (in reinforcement learning, an agent is a program that can act and make decisions in an environment, e.g., game AI, robot control systems, autonomous driving systems) plus the action it wants to take, predict what the next state will become. The term “dynamics” originally came from physics, was abstracted into “dynamical systems theory” by mathematician Poincaré, then borrowed by cybernetics and optimal control, and finally reached reinforcement learning. Today, its form is not limited to physical equations; neural network approximations also count.

Jürgen Schmidhuber’s 1990 paper Making the World Differentiable marks the starting point for the world model lineage in machine learning. The paper proposed learning a dynamics model of the environment using a neural network, allowing the Agent to “imagine” the future within this learned model.

Jürgen Schmidhuber

Schmidhuber’s line of thought couldn’t be scaled up in the 1990s due to insufficient algorithms and computing power. It wasn’t until 2018, when he collaborated with David Ha (then at Google Brain, now CEO of Sakana AI) on the paper World Models, that the first complete deep learning implementation of a world model was provided.

Ha & Schmidhuber 2018’s approach was elegant. They compressed visual input into low-dimensional vectors using a VAE (Variational Autoencoder), then used an RNN (Recurrent Neural Network) to predict the next vector and reward. The Agent (a small policy network) was trained entirely within this compressed “dream”. They demonstrated in the CarRacing-v0 and VizDoom environments that an Agent trained purely by “dreaming” inside the model could perform well when transferred back to the real environment.

This 2018 paper is the tipping point for contemporary world model research. All subsequent major world model lines, whether the Dreamer series, JEPA, Sora, or World Labs, have some connection to this Schmidhuber-Ha lineage. Schmidhuber himself has since publicly argued multiple times that “world models are the key to AGI”, forming an alliance of sorts with LeCun’s JEPA line.

1.4 The Lull and Explosion: 2018 to 2024

Although the 2018 paper was heavily cited within the RL community, the world model path didn’t truly take off at the time. The entire AI community’s attention for several years was diverted to a completely different direction.

2018 was exactly the year the LLM line began its explosion. In June of that year, OpenAI released GPT-1, and in October, Google released BERT. The capabilities of the Transformer architecture (2017) began to be validated by scaling up. In the following years, almost all resources in AI went to LLMs: GPT-2 (2019), GPT-3 (2020), PaLM (2022), ChatGPT (2022.11) pushed AI to the center of public discourse. World models were pushed to the periphery. Hafner’s DreamerV1 and DreamerV2 (2020) continued to update within the small RL circle but went largely unnoticed.

Things started to change after GPT-4 was released in 2023. The marginal returns of LLM scaling began to diminish, and fundamental limitations were exposed: hallucinations, inability to reason, inability to interact with the physical world, inability to continuously learn. Starting in 2022, Yann LeCun publicly argued repeatedly on Twitter, in speeches, and on the Lex Fridman podcast that “LLMs are a dead end,” advocating for building world models. Around the same time, Hafner at Google DeepMind produced DreamerV3 (arXiv Jan 2023), which for the first time used a single set of hyperparameters across 150+ tasks.

The real explosion of the world model concept was in 2024. In February, OpenAI released Sora, publicly positioning it as a “world simulator,” pushing the concept of world models into public view for the first time. In the same month, Meta released V-JEPA, the first large-scale implementation of LeCun’s line. Google DeepMind released Genie 1, and Fei-Fei Li founded World Labs. Four major events in one year turned “world model” from an academic term into an industry focus.

Capital and narratives continued to amplify in 2025-2026. NVIDIA Cosmos at CES in January 2025 brought the hardware giant into the field. Meta’s IntPhys 2 (June 2025) systematized a physics benchmark. DreamerV3 officially landed in Nature in 2025. In November 2025, Yann LeCun left Meta to found AMI Labs (raising a $1.03B seed round in March 2026). Fei-Fei Li’s World Labs released Marble concurrently (raising $1B in February 2026). Sora 2 (Oct 2025) was shut down 5 months later, exposing the unit economics problem of the video generation faction.

So, what happened between 2018 and 2024 can be summarized in one sentence: After LLM scaling hit a ceiling, the AI community needed a new narrative, and the 30+ year old concept of ‘world model’ was ‘rediscovered’. Craik 1943 → Schmidhuber 1990 → Ha 2018 → 6 years of quiet → 2024 explosion. This curve is anything but smooth. The two launches in 1990 and 2018 didn’t take off; it took the LLM hitting a ceiling for this old concept to be unearthed.

2. World Models for Action

World models today are divided into two major camps. The first camp treats world models as “internal tools for the Agent.” The model is primarily used by the Agent itself, operating in abstract or latent space. Its output is an internal signal for the Agent to anticipate consequences, not visuals for humans.

This camp is further divided into two schools: the Decision-Planning School (model-based reinforcement learning) and the Abstract Reasoning School (JEPA / Joint Embedding Predictive Architecture). Their commonality is positioning the world model as a “tool for decision-making.” Their difference lies in the degree of abstraction and specific architecture.

2.1 The Decision-Planning School: DreamerV3 and Model-Based RL

The Decision-Planning School is the oldest and most operational definition of world models. Its proposition: first learn a dynamics model of the environment, then let the Agent “imagine” future trajectories and try things out within this model before deciding on real actions. The essence is to move expensive real-world interaction into simulated mental rehearsal, improving sample efficiency.

Technically: sensory input (images, state vectors) is encoded into a compact latent state, which is then used to predict the next latent state and reward. The policy is trained entirely on these “imagined” sequences. This overall architecture was initially provided by Ha & Schmidhuber 2018, and later systematized by Danijar Hafner at Google DeepMind into the Dreamer series (DreamerV1, V2, V3).

The current benchmark of this line is DreamerV3. The paper Mastering Diverse Domains through World Models by Hafner, Pasukonis, Ba, and Lillicrap was posted on arXiv in January 2023 and officially published in Nature in 2025. DreamerV3’s most impressive feat is using a single set of hyperparameters to master over 150 different tasks (from Atari games to robot control to Minecraft), completely eliminating the need for manual tuning for each new domain. This was the long-sought “general algorithm” goal in model-based reinforcement learning, and DreamerV3 is the first to achieve it.

DreamerV3 also has a landmark achievement: it was the first reinforcement learning algorithm to mine a diamond in Minecraft (one of the hardest core tasks in the game) without human demonstrations. This result is frequently cited in the RL community because previous methods all required human gameplay videos as demonstrations to learn this long-horizon task.

DreamerV3’s goal and evaluation metrics are also highly operational. This school doesn’t care if the model can “reproduce real physics”; it only cares whether the actions planned by the Agent within the model can complete the task when executed in the real environment. Planning success rate, task completion rate, and sample efficiency are the hard metrics of this school. It’s good enough for planning; it doesn’t require the model’s generated images to be beautiful or physically strictly correct.

This “good-enough-ism” is a characteristic and also a limitation of this school. The Dreamer series works well in structured environments (games, simulators) but still struggles with open worlds, long-horizon tasks, and real physics. Whether starting from the decision-planning school can lead to a general world model is an open question.

2.2 The Abstract Reasoning School: JEPA and LeCun’s AMI Labs

The Abstract Reasoning School is another inward-looking line, championed by Yann LeCun’s long-promoted Joint Embedding Predictive Architecture (JEPA).

JEPA’s proposition: predict “how the state will evolve” in an abstract representation space, bypassing pixel-level image generation. Specifically, it encodes the input, performs masked prediction in the representation space, and completely avoids pixel reconstruction.

LeCun repeatedly emphasizes this as a fundamental divergence from the generative approach. In his view, understanding the world by generating realistic pixels is a dead end. Models waste vast amounts of computation on visual details irrelevant to the task (how leaves rustle, how light reflects), missing the truly important causal structure. JEPA emphasizes ignoring irrelevant details and only capturing the key causal information that influences the evolution of subsequent states.

Under LeCun’s direction, Meta developed two generations of the visual JEPA: V-JEPA and V-JEPA 2. This line has performed more stably than pixel-generative models on intuitive physics benchmarks (like IntPhys 2, discussed in Chapter 4 below).

However, LeCun’s collaboration with Meta reached its end in 2025. On November 19, 2025, he officially announced his departure from Meta to found Advanced Machine Intelligence Labs (AMI Labs) together with Alex LeBrun, dedicated to “physical world models.” In March 2026, AMI Labs closed a $1.03 billion seed round at a pre-money valuation of $3.5 billion, co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions.

LeCun’s public statement upon leaving Meta was direct: he left because he didn’t want to be constrained by product timelines and wanted to focus on research leading to “human-level AI.” AMI Labs’ bet is clear: the JEPA line is the correct solution for world models, and the generative video line (Sora, Cosmos, etc.) is a dead end. This billion-dollar seed round is the first time the JEPA line has had sufficient capital to operate independently of Meta.

3. World Models for Presentation

The second major camp of world models seeks to create “worlds for humans to see.” The output of this camp is itself a presentation of the world: a viewable video, or an enterable 3D space. The model’s “finished product” is a presentation of the world, not an intermediate variable for internal decision-making.

This camp is also divided into two schools: the Video Generation School (with 2D video as the medium) and the Spatial Intelligence School (with 3D as the medium). Commercially, these two schools have burned the most money in the last year or two; capital clearly has more confidence in “world models for presentation.”

3.1 The Video Generation School: Sora, Cosmos, Genie

The Video Generation School’s proposition: train a model to generate videos that are visually realistic and appear physically plausible, treating the “video model capable of generating coherent worlds” itself as a kind of world simulator.

Technically, diffusion models and autoregressive models are dominant, with training objectives focused on predicting pixels or frames. The model learns the appearance and dynamic rules of the world from massive amounts of video (YouTube, movies, game recordings). This path produced several important works in 2024-2025.

OpenAI’s Sora was announced in February 2024 and officially released in December 2024. Sora uses a diffusion transformer (DiT) architecture, similar in origin to Stable Diffusion and DALL-E 3 but scaled up to video. It can generate up to 60 seconds of high-definition video. Physically imperfect, but visually impactful. At launch, OpenAI’s CTO Mira Murati publicly positioned Sora as a “world simulator,” a key step in OpenAI’s roadmap to AGI. Industry estimates of Sora’s training cost exceed $100 million, trained on millions of hours of video.

But Sora’s commercial story took a sharp turn in 2026. In October 2025, OpenAI released Sora 2 along with a social app. User numbers peaked at around 1 million but quickly dropped to under 500,000. The app burned approximately $1 million per day (high GPU cost for video generation), with a maximum monthly revenue of only about $540,000. On March 24, 2026, OpenAI announced the shutdown of Sora. The web and app experience officially closed on April 26, 2026, and the API will also close on September 24, 2026. OpenAI’s $1 billion partnership with Disney also died; Disney was reportedly notified less than an hour before the shutdown announcement. Sora’s closure was a significant blow to the “video generation as world model” path. It demonstrated that the unit economics of consumer video generation are not viable with current GPU costs, and within OpenAI’s resource allocation, Sora was outcompeted by enterprise-grade, high-margin products like Codex and GPT-5.5. This serves as a cautionary tale for the Cosmos and Genie paths, but NVIDIA and Google’s market positioning differs from OpenAI’s (they don’t directly target consumer video), so they are still running.

NVIDIA Cosmos, announced at CES in January 2025, is positioned as “World Foundation Models for Physical AI.” Cosmos is designed to generate training data for robotics and autonomous driving, not for consumer video generation. NVIDIA’s strategy is to integrate Cosmos with Omniverse (a 3D simulation platform) to create an end-to-end “synthetic data plus physics engine” infrastructure. This is an important step for NVIDIA’s transition from a GPU seller to an AI model provider.

Google DeepMind’s Genie series takes a different approach. Genie 1 (February 2024) can generate an interactive 2D game environment from a single image. Genie 2 (December 2024) extends this to 3D environments, generating the next frame based on keyboard and mouse input, allowing users to “play” a non-existent game. The Genie series fuses “world simulator” with “playable game,” technically combining video generation with real-time interaction.

There are also other video/game world models being released, such as Veo (Google), Wan (Alibaba), Kling (Kuaishou), and Sora 2 (OpenAI 2025). The evaluation metrics for this school are visual quality and physical plausibility, which have also been the most controversial in the past year or so (see Chapter 4 below).

3.2 The Spatial Intelligence School: World Labs and Marble

The Spatial Intelligence School’s proposition: The essence of the world is three-dimensional. AI should directly perceive, generate, and interact within 3D scenes, rather than being confined to 2D video.

The representative of this line is World Labs, founded in 2024 by Fei-Fei Li (former director of Stanford AI Lab, mother of ImageNet). Li uses the term “spatial intelligence” to define this route, arguing that Agents must not only recognize pixels but also understand spatial relationships and be able to reason and act in 3D space.

World Labs

World Labs released its first commercial product, Marble, in November 2025. The key difference between Marble and video generators like Sora is that it produces persistent 3D environments that can be downloaded, edited, and freely navigated within. Users input text, images, or video, and Marble generates a complete 3D world that can be exported in three formats: Gaussian splats, mesh, or video.

Marble’s underlying technology, Gaussian splatting, is a novel 3D representation method proposed by the Inria team in 2023. It enables high-quality photorealistic rendering while supporting real-time interaction, 10 to 100 times faster than traditional mesh-plus-texture methods. This technical choice gives Marble a good balance between quality and usability.

Commercially, World Labs has moved quickly. After its founding in 2024 with a $230 million seed round, it completed a $1 billion Series B in February 2026, bringing total funding to $1.23 billion. Investors include NVIDIA, AMD, Fidelity Management, Autodesk, Emerson Collective, and Sea. This scale puts it in the top tier of AI startups, an interesting counterpart to AMI Labs’ $1 billion seed round.

Li’s judgment differs from LeCun’s. She believes that both the video generation route (Sora et al.) and the decision-planning route (Dreamer et al.) only capture one aspect of the world. A true world model must be 3D, interactive, and persistent. Marble is the product embodiment of this judgment.

3.3 Summary: Why the Four Schools are Hard to Compare Directly

Putting the two groups together reveals that the single term “world model” actually covers four different definitions and four different evaluation standards:

Decision-Planning School (DreamerV3): Planning success rate, task completion rate
Abstract Reasoning School (JEPA / AMI Labs): Causal consistency, downstream task performance
Video Generation School (Sora, Cosmos, Genie): Visual quality, physical plausibility
Spatial Intelligence School (World Labs Marble): Fidelity of 3D reconstruction/generation, interactivity

The optimization objectives of the four schools are fundamentally different. The two “inward” schools pursue “good enough for decision-making,” while the two “outward” schools pursue “realistic enough” or “interactive enough.” Rankings vary even within their own evaluation standards. Comparing Dreamer to Sora is like comparing apples to oranges; comparing V-JEPA to Marble is similarly futile.

This is why the discussion about “world models” in 2024-2026 has been so chaotic. People say the same word but refer to fundamentally different things. When reading any article about “world models,” the first thing to do is to identify the author’s school. Judging criteria differ across schools, and scores are not comparable across them.

It must be acknowledged: the above 4 schools are the most mainstream classifications, covering the major players (LeCun, Li, OpenAI, Google DeepMind). But strictly speaking, there are some finer subdivisions worth acknowledging: the Wayve GAIA series (released GAIA-3 with 15 billion parameters in December 2025) and Tesla are building world models for autonomous driving; Google DeepMind’s SIMA and Physical Intelligence’s π0 series are building world models for robot embodiment; Yoshua Bengio advocates the Bayesian probabilistic route (another way of “predicting in abstract space” different from JEPA); and projects like Genesis are building hybrid routes combining physics engines with generative models. Two academic survey papers from 2025 use different classification axes than my 4 schools. This article focuses on the most mainstream 4 schools to establish a basic landscape; interested readers can investigate these finer directions on their own.

4. Core Debate: Does the Model “Understand” the World, or Just Learn Surface Correlations?

The four schools have different optimization objectives, but there is one debate they all share: do these models capture genuine world dynamics and causality, or just superficial statistical correlations?

This question isn’t just philosophical; it can be empirically tested. Researchers have developed a set of “intuitive physics benchmarks” specifically designed to probe the model’s understanding of physical common sense. These benchmarks originate from the “violation of expectation” paradigm in infant psychology: showing a subject a video that “violates physics” to see if they can detect that “this shouldn’t happen.”

Meta’s IntPhys 2, released in June 2025, is the latest representative of this line. The team led by authors Florian Bordes, Quentin Garrido, Justine Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux designed a video benchmark based on “violation of expectation” to test models’ understanding of four physical principles:

Permanence: Objects do not disappear without reason.
Immutability: An object’s color, shape, etc., do not change without reason.
Spatio-Temporal Continuity: Objects only move along continuous trajectories.
Solidity: Solid objects cannot pass through each other.

These four principles are considered in infant psychology to be “core knowledge” that humans begin to establish within months of birth. IntPhys 2 turns them into a series of video comparisons of “possible events vs. impossible events,” asking the model to predict which one “violates physics.”

The results were telling: at the time, all mainstream video models (including V-JEPA 2, Cosmos, and some closed-source models) performed near random chance (around 50%) on IntPhys 2, while humans scored nearly perfectly. Models could generate visually realistic videos but had a very weak understanding of the most basic physical common sense. This was quite a serious conclusion, as it directly challenged the industry’s common assumption that “generative video models understand physics.”

Similar benchmarks like Physics-IQ (DeepMind 2025) and VideoPhy (UCLA 2024) have reached similar conclusions: generated videos frequently violate basic physics rules; what the models capture might be more “statistically looking like physics” rather than “truly understanding physics.”

However, there are counterexamples. Some empirical studies show that abstract/latent space predictive models (like the V-JEPA line) are more stable on intuitive physics benchmarks than pure pixel-generative models, because the former forces the model to learn at the “state evolution” level, filtering out purely visual “decorative details.” Other studies have observed that as model scale increases, physical understanding improves somewhat. So whether “physical understanding will eventually emerge” remains undecided.

Objectively presenting the focus: Correlation vs. Causation, Surface Realism vs. True Dynamics — there is no consensus in the world model field on these two axes. LeCun’s camp treats benchmarks like IntPhys 2 as a “death sentence for the generative route,” while organizations like OpenAI once saw it as a “current technical limitation that scaling can solve.” However, after OpenAI shut down Sora (consumer video) in March 2026, the practical credibility of the “scaling can solve it” position has weakened significantly.

5. Methodology: No “Selected” Architecture Like Transformer Has Emerged Yet

Chapters 2 and 3 described how each of the four schools works. From a methodological perspective, all these architectures are still engineering hodgepodges; there is no repeatedly validated, industry-wide “benchmark architecture.” This is the area where world models are most immature compared to LLMs.

5.1 The Transformer Analogy: From Contention to Decoder-Only Dominance

The original Transformer proposed in 2017’s Attention is All You Need was a complete Encoder-Decoder architecture designed for machine translation: the Encoder reads the source language for understanding, and the Decoder generates the target language. But this complete architecture is not the mainstream form of today’s LLMs.

In 2018, Google’s BERT (Devlin et al.) took the Encoder part of the Transformer, added the masked language modeling task (randomly masking 15% of words for the model to predict), and created the first powerful bidirectional language understanding model. BERT swept NLP benchmarks like GLUE and SQuAD, becoming the hottest architecture in the NLP community from 2018-2020. A host of BERT variants emerged simultaneously: RoBERTa, ALBERT, ELECTRA, XLNet, etc.

In the same year, OpenAI’s GPT-1 took the Decoder part of the Transformer for unidirectional language generation. GPT-1’s reception was far less impressive than BERT’s. But with GPT-2 (2019), GPT-3 (2020), and ChatGPT (2022) scaling up, the Decoder-only line began to overtake: the same generative architecture could be used for both understanding and generation, and in-context learning made fine-tuning unnecessary.

Today, almost all mainstream LLMs are Decoder-only: GPT series, Claude, Gemini, Llama, Qwen, DeepSeek, etc. BERT-style Encoder-only models have been relegated to niche tasks like embeddings and classification (Sentence-BERT, E5, etc.). The full Encoder-Decoder architecture (like T5) is rarely used. There were also many intermediate variants (XLNet, ELECTRA, ALBERT, Reformer, Performer, Longformer, etc.), most of which are no longer in use today. Attempts like Mamba and SSM (2023-2024) to fundamentally replace the Transformer largely failed.

This is the typical path of a “selected” architecture: fierce contention (Encoder vs Decoder vs full version vs various variants) + massive replication and validation + survival of the fittest + eventual convergence to Decoder-only. The Transformer wasn’t the “correct architecture” crowned by the 2017 paper alone; it was the one that won out among a host of candidates over 5 years. During that time, dozens of Transformer variants were eliminated, and several fundamentally different architectural directions (like Mamba) proved inferior.

5.2 World Models Are Still in the “Patchwork” Stage

Looking back at the main architectures of world models, each is a combination of components from different traditions:

DreamerV3 = VAE (Variational Autoencoder) + RNN + Actor-Critic RL + Imagined Rollouts. Each component comes from a different era and field.
JEPA = Encoder + Predictor + Mask training + Engineering tricks to prevent representation collapse (VICReg, EMA, etc.).
Sora (DiT architecture) = Diffusion + Transformer + Patches, essentially gluing three existing methods together.
Marble = Gaussian Splatting + Diffusion + 3D representation, a combination of methods that only matured in 2023.

Each component was a popular method in its own field over the past few years, picked by engineers and glued together into a new architecture.

The key problem is: there is no theory telling you how to glue them. Why does Sora use Diffusion + Transformer + Patches instead of Diffusion + LSTM + Pixels? Why does JEPA use mask + predict instead of contrastive learning? Why does DreamerV3 use an RNN instead of a Transformer for latent state prediction? Behind these choices, there is no “theoretically correct” argument, only the engineering intuition of “these components are hot recently, they work individually, let’s try gluing them together.”

This “patchwork by experience” approach in the Transformer era was buffered by massive replication and validation. When a paper came out, hundreds of teams would replicate, modify, and compare. But the world model field currently lacks this ecosystem almost entirely. Sora is closed-source and cannot be replicated externally. Cosmos is partially open-source, but its training cost is estimated at tens of millions of dollars, so few can run it. JEPA is open-source, but its effects are a completely different paradigm from Sora, leaving the community unsure which to believe. Different companies are doing their own things, and replication work is scarce.

The deeper problem is not knowing how effective it is. IntPhys 2 only tests physical common sense. VBench and VideoMME test video generation quality. Robotic benchmarks are another set entirely (ManiSkill, RoboCasa). There’s no cross-architecture comparison possible. Sora scoring high on VBench doesn’t mean it works for robot control. V-JEPA scoring slightly higher on IntPhys 2 doesn’t mean it can produce good videos. Each camp defines its own benchmarks, making cross-school score comparisons of limited meaning.

So, what everyone is doing now is essentially betting on an architecture based on intuition, building it, releasing it, and waiting 6-24 months for market and benchmark feedback. When Sora was released in February 2024, everyone was amazed. 18 months later, OpenAI shut it down itself. Marble was released in November 2025, only 6 months ago; it will take 2-3 years to tell if it’s truly good. AMI Labs has $1 billion in seed funding but no product yet; it will be 2-3 years before we know if the JEPA path can succeed.

There are also several secondary sub-axes (whether to bake physics engines into the model, whether to use 2D video or native 3D representation) with no consensus. Every choice is engineering patchwork, not the result of selection. The entire field is currently in a “trial-and-error phase,” far from the stage of “which architecture is the correct one.”

A direct consequence of “patched-together” architectures is: no one knows if they are durable. Current world models look passable within short time windows (seconds to tens of seconds), but as the time horizon extends (minutes or more), errors accumulate frame by frame, and physical consistency gradually breaks down. Sora’s 60-second videos look decent for the first few seconds, but by 30-40 seconds, object deformations and interpenetration begin to appear. DreamerV3 excels on short-horizon tasks like Atari, but on long-horizon open-ended tasks, the longer it runs, the more the predictions drift. This is a side effect of patched-together architectures not being filtered by the “long-term stability” dimension, and currently, no one knows how to fundamentally solve it at the architectural level. This challenge will be revisited as an open problem in Chapter 7.

5.3 LeCun’s EBM Example: Betting on an Architecture Doesn’t Guarantee Success

There is a very specific counterexample here.

Since the 1990s, Yann LeCun has long advocated the Energy-Based Models (EBM) architectural line: train an energy function E(x) such that correct data points have low energy and incorrect ones have high energy; during inference, find the point with the lowest energy. His 2006 A Tutorial on Energy-Based Learning systematized EBMs into a complete framework, making it a classic text in the field.

LeCun pushed EBMs for over twenty years, betting that this path would eventually become the dominant generative model architecture in deep learning. The result: EBMs did not become mainstream. The generative model landscape was captured by three other paths: GANs (2014), VAEs (2013), and Diffusion Models (starting around 2020). EBMs remained a relatively niche branch (although Score-based generative models are mathematically related to EBMs, the actual architectures are not called EBMs).

When LeCun started promoting JEPA in 2022, he was essentially replacing the “energy” in EBMs with “representation space prediction,” bypassing the engineering difficulty of training EBMs (partition function intractability). JEPA is an engineering downgrade of EBMs, the same idea in a more trainable form.

This story tells us one thing: Even for a deep learning godfather like Yann LeCun, betting on an architecture by intuition is not guaranteed. He bet on EBMs for over 20 years and lost, eventually switching to JEPA. Among the several new architectures in the world model field today (DreamerV3, JEPA, DiT, Marble), which will ultimately prevail like the Transformer, and which will fade away like the EBM, is currently entirely unknown.

5.4 Choosing an Axis is Essentially Gambling

The central axis choice of “generating pixels vs. abstract representation” is fundamentally a gamble. There is no theoretical proof for which is correct; it’s just different people betting based on experience and intuition.

Sora bet on generating pixels and was shut down in March 2026. It didn’t win. V-JEPA bet on abstract representation, currently performs slightly better on intuitive physics benchmarks, but is far from an overall victory. Marble bet on 3D representation; the product just launched; the results remain to be seen. LeCun bet on EBMs for 20 years and lost; his current bet on JEPA is also uncertain.

The whole field currently resembles the NLP community around 2014