@dotey: https://x.com/dotey/status/2053351712149135385
Summary
NVIDIA's Jim Fan spoke at Sequoia AI Ascent 2026, declaring the VLA architecture obsolete and proposing World Action Models (WAM) as a new paradigm for robotics. He introduced key technologies including DreamZero, EgoScale, and the neural simulator Dream Dojo.
View Cached Full Text
Cached at: 05/10/26, 08:24 AM
The Endgame of Robotics: Nvidia’s Jim Fan Declares the End of the VLA Era and the Rise of WAM
Jim Fan is the head of Nvidia’s Robotics and AI research group (GEAR Lab). Over the past few years, he has championed the GR00T humanoid robot foundation model, which relies on the VLA (Vision-Language-Action) architecture. In a recent 20-minute speech at Sequoia AI Ascent 2026, titled “Robotics’ End Game,” his first major announcement was that the VLA approach is obsolete—including GR00T, which he himself promoted just six months ago.
The new paradigm replacing it is called the World Action Model (WAM), with Nvidia’s DreamZero (released in February) as its flagship example. He describes this approach as “bottom-up isomorphism”: replicating the three steps taken by Large Language Models (LLMs)—pretraining, alignment, and reinforcement learning—but substituting video world models for language models and replacing teleoperation data with first-person human video. His ultimate goal is to enable robots to design and manufacture their next generation by 2040. He is 95% confident in this prediction.
Jim Fan @DrJimFan · May 8
I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year’s Sequoia AI Ascent talk, “Physical Turing Test”. I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy…
Source: Sequoia Capital AI Ascent 2026, released April 30, 2026.
Original Video: https://www.youtube.com/watch?v=3Y8aq_ofEVs
Key Takeaways
- The End of the VLA Route: Jim publicly declared the VLA route obsolete. The new paradigm is the World Action Model (WAM), represented by DreamZero (14 billion parameters).
- Farewell to Teleoperation Data: Teleoperation has a low physical ceiling. It is predicted to drop to near zero within a year or two, replaced by sensorized human data.
- Neural Scaling Laws: EgoScale was pretrained using 21,000 hours of first-person human video. The team discovered neural scaling laws for dexterous manipulation (R^2 = 0.998).
- Neural Simulators: Dream Dojo uses 44,000 hours of human video to train a neural simulator that completely bypasses traditional physics engines.
- Countdown to the Endgame: A prediction that the robotics endgame (physical automated research) will be achieved by 2040, with 95% confidence.
From the DGX-1 Signature to “Bottom-Up Isomorphism”
Jim opened with an anecdote. In the summer of 2016, at what was then OpenAI’s office, Jensen Huang walked in wearing his signature leather jacket, carrying a large metal tray engraved with: “To Elon and the OpenAI team, for the future of computing and humanity.” It was the world’s first DGX-1.
Jim was OpenAI’s first intern at the time and rushed to sign it. “At that moment, I had no idea what I was signing.” Andrej Karpathy was signing nearby as well. The machine is now housed in the Computer History Museum. Jim added that he felt as old as a dinosaur.
Note: Jim Fan (Fan Linxi) is Nvidia’s Director of Robotics and AI and Distinguished Scientist, leading the GEAR Lab and the GR00T humanoid robot project. During his internship at OpenAI in 2016, his mentors were Ilya Sutskever and Andrej Karpathy. He later completed his PhD at Stanford under Fei-Fei Li.
This story served to introduce his core framework. Quoting Ilya’s line, “If you believe in deep learning, deep learning believes in you,” Jim noted that LLMs reached their current state in just three leaps over six years: GPT-3 pretraining, InstructGPT supervised fine-tuning, o1-style reinforcement learning, and finally, automated research.
He made a decision: copy the playbook, rename it, and call it “The Great Parallel” (Bottom-Up Isomorphism). Instead of simulating the next state of strings, simulate the next state of the physical world. Converge on the specific portion needed by the robot through action fine-tuning, and let reinforcement learning cover the final mile.
If you can’t beat them, join them. (“If you can’t beat them, join them.”)
What’s Wrong with VLA? Parameters Are All Stacked on Language
For the past three years, the mainstream architecture in robotics has been VLA (Vision-Language-Action). Nvidia’s own GR00T and Physical Intelligence’s π0 both fall into this category.
Jim pointed out a structural problem: these models should actually be called LVA, because the majority of parameters are stacked on language. Language is the first-class citizen, vision comes second, and action is relegated to the bottom.
VLAs excel at encoding knowledge and nouns, but struggle with physics and verbs. The center of gravity is in the wrong place.
He cited the classic demo from the original RT-2 paper: asking a robot to push a Coke can next to a photo of Taylor Swift. The model had never seen Taylor Swift, but it could generalize to the task. However, what generalized was the noun (recognizing Taylor Swift), not the verb (how to push, what angle to use, how much force to apply).
From AI Junk Videos to DreamZero
If VLA isn’t the answer, what is the next pretraining paradigm? The answer turned out to be video models, which learn internally to simulate the next state of the physical world.
How do you make these world models useful? Perform action fine-tuning. Collapse the superposition of “all possible futures” into a single action trajectory that is meaningful for a real robot.
Nvidia’s answer is DreamZero. This is a new type of policy model that “dreams” a few seconds into the future before executing an action, then acts based on that dream. DreamZero decodes both the next frame and the next action simultaneously. Here, vision and action finally become true “first-class citizens.”
Jim frankly admitted that DreamZero is not currently 100% reliable for every task. “It’s roughly at the GPT-2 stage—the direction is correct, but the performance is not yet stable or reliable enough.” He named this new architecture WAM (World Action Models).
Let us take a moment to mourn our dear VLA. It has fulfilled its historical mission. Rest in peace. Long live the World Action Models.
Note: The DreamZero paper (arXiv 2602.15922) was released in February 2026. With 14 billion parameters, it is based on the Wan2.1 video diffusion model. It has a key limitation: the 14B model requires 38x system-level optimization combined with GB200 hardware to keep closed-loop control at 7Hz, resulting in a very high deployment threshold.
The Data Revolution: From Teleoperation to “Data Collection Without Robot Involvement”
The past three years were the golden age of teleoperation. But teleoperation has a hard ceiling: 24 hours per day per robot.
“When I say 24 hours a day, I’m fooling myself. In reality, you’re lucky to get 3 hours of productive work a day, and it depends on whether the ‘God of Robotics’ is feeling generous—that day—because these machines throw tantrums and break down every day.”
So how do we break through? Wear the robot’s end-effector directly on a human hand to collect data, completely bypassing the robot body itself.
Nvidia’s solution is DexUMI, an exoskeleton device. Robot policies trained on exoskeleton data can run fully autonomously, with zero teleoperation data in the training set.
Robots are happy because they finally don’t have to participate in data collection.
EgoScale: 21,000 Hours of Human Video and Scaling Laws
Nvidia introduced EgoScale: 99.9% of the training data comes from human first-person (egocentric) video.
Pretraining used 21,000 hours of in-the-wild human data, with zero robot data. The action fine-tuning stage used only 50 hours of high-precision motion capture glove data and 4 hours of teleoperation data—combined, this accounts for less than 0.1% of the total training volume.
The most important discovery was the neural scaling law for dexterous manipulation. There is an extremely clear log-linear relationship between the compute hours invested in pretraining and the optimal validation loss, with an R^2 value of a staggering 0.998.
Jim aggregated the scalability of all data strategies: teleoperation sits in the least scalable corner; first-person video, if it can spin up an FSD (Full Self-Driving)-style data flywheel, could reach 10 million hours within a year.
Dream Dojo: A Neural Simulator Without a Physics Engine
The robotics field also needs to spend heavily on millions of programming environments for reinforcement learning (RL), but sim-to-real (or real-to-sim-to-real) with traditional simulators is insufficient.
The further solution is Dream Dojo: ditch the physics engine entirely and transform a video world model into a complete neural simulator. The input is continuous action signals; the output is the next RGB frame and sensor states in real-time. No physical equations, no graphics engine—it is purely data-driven.
Not a single pixel in what you see is “real” in the traditional sense.
“Compute equals environment equals data. Or, as a certain wise man said: ‘The more you buy, the more you save.’ This message has been approved by my boss.”
The Endgame Roadmap: Three Achievements Before 2040
Jim compared the remaining path for robotics to three tech-tree achievements that must be unlocked:
- Physical Turing Test: Within 2–3 years, you won’t be able to tell whether a task is being performed by a human or a robot.
- Physical APIs: Use software and large models to orchestrate robot configurations, building “dark factories” and automated scientific labs.
- Physical Automated Research: Robots begin to design, improve, and manufacture the next generation of robots themselves.
As for the timeline, he drew a parallel to AI taking 14 years from AlexNet (2012) to Agents (2026). Add another 14 years, and you land exactly on 2040.
Our generation was born too late to explore the Earth during the Age of Discovery, and too early to reach for the stars to explore the universe. But we were born at just the right time to conquer the robotics challenge.
Five Questions Answered Quickly
Q: Is VLA really dead?
A: In terms of the keynote narrative, yes. However, Nvidia’s latest GR00T N1.7 (April 2026) paper still explicitly mentions “VLA models.” The paradigm shift has not yet been completed internally.
Q: Can DreamZero be used in production environments now?
A: No. Jim himself said it is “roughly at the GPT-2 stage.” The paper reveals that the 14B model runs closed-loop control at only 7Hz and requires GB200 hardware.
Q: Will teleoperation really be eliminated?
A: Jim predicts it will drop to near zero within a year or two. However, wearing devices for chores isn’t as rigid a demand as driving, and the industry’s vast existing teleoperation infrastructure won’t become obsolete overnight.
Q: What does the scaling law for dexterous manipulation mean?
A: If R^2 = 0.998 holds true, it means that increasing human video data will lead to a predictable improvement in robot dexterity. This is the core empirical evidence of the entire presentation.
Q: What does Nvidia gain from this chess game?
A: WAMs and neural simulators have extremely high compute demands. Jim’s phrase “buy more, save more” directly reflects the commercial intent that a paradigm shift naturally helps sell chips.
Finally: Three Suspenseful Points Worth Tracking
Three things are worth tracking:
- How DreamZero crosses the “GPT-2 Stage”: Whether it can stabilize extreme parameters over the next 12–18 months will determine the true power of this paradigm.
- Nvidia’s Internal Switching Point from the VLA Paradigm: Watch for substantial architectural evolution in their product updates. If the next generation is still VLA, the speech was likely more conceptual marketing.
- The Carrier of the First-Person Video Data Flywheel: Nvidia itself lacks a consumer hardware entry point. We need to see who (e.g., Apple, Meta) can truly spin this data flywheel to the scale of tens of millions of hours.
Similar Articles
@DrJimFan: I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Asce…
In his talk at Sequoia AI Ascent, Dr. Jim Fan presents a roadmap for achieving Physical AGI parallel to LLM success, introducing concepts like video world models, World Action Models (WAM), and the Dexterity Scaling Law, and sharing predictions for the near future.
@WSInsights: https://x.com/WSInsights/status/2052986400740638991
A Chinese analysis article covering Sequoia Capital's 2026 AI Ascent closed-door summit, summarizing key insights from attendees including Demis Hassabis, Andrej Karpathy, and Greg Brockman: AGI has arrived, 2026 is the year of Agents, AI will reshape white-collar work, and a 6-step action plan for ordinary people to adapt.
Claude Opus 4.7, Qwen 3.6, Happy Oyster, realtime 3D worlds, new Google TTS: AI NEWS
Anthropic, Alibaba, Google and others unleash a wave of major model drops—Claude Opus 4.7, Qwen 3.6, emotion-rich Google TTS, plus tiny 1.58-bit phone LLMs and real-time 3-D world generators—alongside open tools for video, VR and character creation.
Claude Mythos, Deepseek v4, HappyHorse, Meta’s new AI, realtime video games: AI NEWS
Anthropic unveils a withheld Claude Mythos model that autonomously finds thousands of 0-days, ZAI open-sources the 1.5 TB GLM-5.1 that tops open-weight benchmarks, Alibaba’s unreleased HappyHorse video model hits #1 on public leaderboards, and Deepseek teases an “Expert Mode” v4 preview.
@runes_leo: At Sequoia Ascent on 4/30, Karpathy compressed this year’s most valuable explanation of AI into three core arguments. You’ll see AI differently after reading this. 1. AI Isn’t Just “Faster,” It’s a New Paradigm For the past two years, the narrative has been that AI speeds things up. Karpathy says this is a misunderstanding...
This article summarizes Karpathy’s core points at the Sequoia Ascent conference, highlighting that AI is a paradigm shift restructuring workflows rather than merely an acceleration tool. It introduces the concept of a "jagged edge" for model capabilities based on verifiability and economic viability, and predicts that future software will evolve into an agent-native architecture where LLMs serve as the logic layer and traditional code functions as sensors and actuators.