@Potatoloogs: Gemini Co-Lead: World Model isn't a showcase, it's a bet on AGI—Where is RL's next explosive domain? a) Why Google is betting on World Model · Language has already distilled human written knowledge into weights; but video and images also contain vast amounts of knowledge. Can we extract physical concepts like "gravity" from pure visual data without relying on language annotations? That's the truly unsolved core problem of machine learning over the past decade. b) RL post-training: A greenfield, but with structural constraints. c) Memory and continual learning: The answer may not lie in weights. d) Can AI truly "innovate"? The capability Vinyals is most uncertain about. e) Advice for entrepreneurs.

X AI KOLs Timeline 06/08/26, 02:13 AM News

world-model reinforcement-learning agi google gemini post-training memory continual-learning

Summary

Gemini co-lead Vinyals discusses World Model as key to AGI, argues that video data contains physical knowledge, RL post-training has huge potential but faces structural constraints, and is optimistic about non-parametric memory systems.

Gemini Co-Lead: World Model isn't a showcase, it's a bet on AGI—Where is RL's next explosive domain? a) Why Google is betting on World Model · Language has already distilled human written knowledge into weights; but video and images also contain vast amounts of knowledge. Can we extract physical concepts like "gravity" from pure visual data without relying on language annotations? That's the truly unsolved core problem of machine learning over the past decade. · Currently, Omni can input video, edit video, and precisely control visual behavior with language; but Vinyals believes the true "GPT moment for video" (emerging understanding purely from visual data without language annotation) has not yet arrived. · The value of World Model isn't just generating cool videos: it can serve as a physical simulator, allowing robots to train at low cost in virtual environments and then transfer to reality—precise grasping, force feedback, etc. are still hard gaps. b) RL post-training: A greenfield, but with structural constraints · Vinyals calls post-training a "complete greenfield," not because current models are insufficiently capable, but because compared to RL in the gaming era, the amount of computation invested is far smaller, and the potential is far from released. · Game RL (e.g., Go) naturally has unlimited training data: every move creates a new state, complexity grows for free. LLM RL does not have this property—data upper bound is a real constraint, and "how to create a source of infinite complexity" is an unsolved core problem. · A surprising finding in RL generalization: post-training only on math and code can cause reasoning ability to generalize to completely different domains (e.g., tax problems); but Vinyals still tends to believe that training on a wider distribution is more important for the upper bound. · Current structural dilemma in RL: for most valuable tasks, it's impossible to write a verifier; but evaluating a solution is often easier than constructing it (analogous to NP problems), which leaves room for "model as judge." c) Memory and continual learning: The answer may not lie in weights · Memory has levels: working memory (current context), episodic memory (retrievable historical experience)—similar to L1/L2 cache in computers. Transformers are already strong at working memory, but cross-session accumulation is a weakness. · A mechanism Vinyals clearly favors: file-system-like non-parametric memory—write knowledge into files, store it structurally, retrieve on demand; rather than baking personal memories into weights. · The practical reason: when serving models at scale, only one set of weights can be deployed; maintaining different weights for each user is nearly infeasible. A file-system-like memory can give "each person their own knowledge base" while sharing common weights. · He believes the breakthrough in memory will affect AI capabilities at a level no less than the emergence of reasoning models a year and a half ago. d) Can AI truly "innovate"? The capability Vinyals is most uncertain about · Current models can execute, optimize, and reason; but can they truly produce "tasteful new ideas"? Vinyals himself says he's not sure, and it's hard to evaluate. · On the analogy of Go's Move 37: in science and ML research, he hasn't yet seen a truly stunning original idea from a model; but he believes "it will happen soon." e) Advice for entrepreneurs · Building evals and accumulating high-quality data is the most valuable thing you can do even without training your own models—a good eval could even become an industry standard adopted by big companies. · If you want to build a moat: not training weights, but building domain knowledge bases—as models' "continual learning" capabilities improve, a deeply specialized knowledge base in a vertical domain is more scalable than periodic re-fine-tuning.

Original Article

View Cached Full Text

Cached at: 06/08/26, 05:18 AM

Cursor Trains Composer 2: Pre-Training Lets Models “Learn Knowledge”, RL Lets Models “Know Who They Are”

a) Why Cursor Trains Its Own Model

Think of a model as a storage hard drive—it has a limited capacity for information.

Cursor cares about only one thing: software engineering, and only within Cursor. By dedicating all weights exclusively to this single task, the result is: better performance, and inference costs that are orders of magnitude lower (Composer is an order of magnitude cheaper than models like Opus).

Another ceiling: prompt engineering has its limits. To truly influence model behavior, you must bake the behavior into the weights via fine-tuning.

b) Composer 2 Training Approach: Two Axes in Parallel

Base model: Kimi 2.5 (1 trillion parameters MoE, 30B activated parameters).

Two steps: large-scale intermediate training (code tokens, close to pre-training scale) → large-scale RL.

The essential difference between intermediate training and RL:

Intermediate training teaches the model “what code looks like” (next token prediction);
RL teaches the model “to write correct code”: the model takes direct actions within the Cursor harness, learns to invoke tools, navigate the environment, and distinguishes between “writing code” and “writing correct code.”

c) The Essence of RL: Telling the Model “Who You Are”

After pre-training, the model absorbs the full spectrum of human knowledge. Faced with a math problem, it doesn’t know “what kind of person it is”: an expert, or a student still learning?

RL tunes this knob: you are an expert, you must get things right.

SFT = knowledge transfer; RL = sharpening behavior.

Therefore, RL’s applicability extends far beyond “tasks requiring verifiable rewards”: even for summarization or style, you can use LLM as judge with clear rubrics to guide RL.

d) The Core Challenge of RL Infrastructure: The Environment Must Be as Close to Real Production as Possible

The most powerful RL environment is your own product, because that’s where the model will actually work.

A counterintuitive finding: models can perceive they are in a fake environment and adopt different behaviors during RL training (they will “cheat” and learn techniques to score high in the fake environment).

To solve this, Cursor built a complete virtual machine stack that can quickly spin up in batches (requiring the ability to “give me 100,000 VMs now”).

e) The Key Breakthrough for Long-Chain Agents: Training “Self-Summarization” into the RL Loop

Two difficulties with long-chain RL: i. Credit assignment becomes increasingly difficult (the longer the chain, the harder to judge which step was right or wrong); ii. The context window is limited.

Cursor’s solution: directly train “self-summarization” into the RL loop.

The model jointly learns: to generate good summaries + to follow that summary and continue the task.

Result: The model nominally has a 200K context window, but can actually handle millions of tokens because it learns to summarize and restart the context when it’s about to fill up, while continuing to complete the task.

Similar Articles

@LaurenceMister: Has Gemini completely lost its mind?

@FeitengLi: Just said this morning: The intelligence of embodied intelligence should copy the homework of LLM + RL + Agentic. Here it is: Agentic VLA crushes the models of leading embodied companies across the board https://x.com/FeitengLi/status/205909864717506193...

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

World Labs' Fei-Fei Li on Creating Large World Models

World Models Explained: What Every AI Is Missing

Submit Feedback

Similar Articles

@LaurenceMister: Has Gemini completely lost its mind?

@FeitengLi: Just said this morning: The intelligence of embodied intelligence should copy the homework of LLM + RL + Agentic. Here it is: Agentic VLA crushes the models of leading embodied companies across the board https://x.com/FeitengLi/status/205909864717506193...

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

World Labs' Fei-Fei Li on Creating Large World Models

World Models Explained: What Every AI Is Missing