@seclink: https://x.com/seclink/status/2067970118873993482
Summary
Current mainstream pure data-driven robot solutions suffer from low data efficiency and poor generalization. The newly proposed neuro-symbolic physical intelligence paradigm breaks down tasks into two steps: world modeling and planning. It requires only 1-10 demonstrations to learn new tasks, and its generalization ability far exceeds traditional end-to-end solutions, providing a more reliable path for general-purpose robots.
View Cached Full Text
Cached at: 06/20/26, 06:20 PM
Why AI Robots Learn New Tasks Far Worse Than Ordinary Humans
📌 Source: https://www.youtube.com/watch?v=3W36pd50Wqw
⚡ Key insight upfront: Robots need to be like humans: think first, then act—not practice millions of times to learn one skill.
The current mainstream pure data-driven approach to robotic physical intelligence has inherent flaws of low data efficiency and poor generalization. Learning a new task requires hundreds of times more samples than humans.
The newly proposed neurosymbolic physical intelligence paradigm decomposes tasks into “world modeling + planning.” It can learn a new task with only 1–10 demonstrations, with generalization capabilities far exceeding traditional end-to-end methods.
In the era of large models, the neurosymbolic approach is still irreplaceable—it’s a reliable path toward general-purpose robots.
Have you noticed a strange contrast? Generative AI can already write code, draw images, and pass professional exams, but AI robots learning a new task are hundreds of times slower than ordinary humans. Folding a box requires 100 hours of training data; operating simple excavator actions requires collecting 200 demonstrations. If you change to a new object not seen during training, the robot fails completely. Behind this contrast lies a fundamental limitation in the current mainstream approach to robotic physical intelligence.
In a cutting-edge academic talk, Jiajun—currently a researcher at Amazon and soon joining U Penn as a system professor—proposed a completely new paradigm for robotic physical intelligence. The core idea is to learn from humans: humans never learn a new task by millions of trial-and-error attempts. Instead, they watch a demonstration, mentally “rehearse” the entire process, and then act once they understand. This new paradigm has been validated in multiple tests, with performance far surpassing traditional methods. Perhaps this is the direction general-purpose robots should take.
💻 The current mainstream approach has inherent flaws
Core conclusion: Pure data-driven end-to-end methods are inherently incapable of achieving general-purpose robots.
What is currently called physical intelligence in robotics refers to AI that can perceive, understand language, and execute actions in the physical world. It is the core capability for robots to actually perform tasks. Many tech companies on the US West Coast follow the mainstream approach of equating intelligence with “fitting a function to a dataset”: directly training models to output the next action from historical observations, with all capability learned purely from data.
The flaws of this approach are critical:
First, extremely low data efficiency and poor generalization. A new task that a human can learn in one look requires hundreds of hours of training and hundreds of demonstrations for a robot. Changing to a new object or scene not seen during training causes immediate failure.
Second, no compositionality. Traditional methods learn only isolated actions that cannot be assembled to complete complex tasks—complex tasks require that preceding actions satisfy constraints for subsequent actions (e.g., first picking up a cup then hanging it; the location of picking must match the hanging requirement). The traditional approach simply cannot achieve such matching.
🧠 New paradigm: Robots “think first,” then act
Core conclusion: Decompose tasks into “world modeling + planning,” use neurosymbolic methods for generalization, and learn new tasks with just a few samples.
The core of this proposed paradigm is to align with human intelligence: humans do not perceive the world by only recognizing pixels; they abstract objects and their properties, mentally simulate outcomes of different actions before acting, and then derive a task plan. Applied to robots, this means decomposing physical intelligence into two steps: world modeling and motion planning, using neurosymbolic concepts for abstraction—simply put, combining the learning ability of neural networks with the abstract compositional ability of symbolic systems to break states and actions into composable modules, making planning easier for robots.
The new paradigm also changes how actions are modeled: instead of directly learning actions, it formulates action generation as a constrained optimization problem—first define a set of conditions (constraints) that the task must satisfy, then find the optimal solution that meets all conditions. Many rules that humans have already thoroughly studied, such as rigid-body physics and geometric constraints, can directly use existing mature models without learning everything from zero from data. Only the parts relevant to the current task need to be learned.
The experimental results of this approach are very impressive: with only one demonstration of hanging a coat hanger, the robot can generalize to hang a mug and hang entirely new 3D-printed letters, with success rates exceeding 90%. In contrast, the traditional pure end-to-end method achieves 0% success rate, and other improved methods perform far below the new paradigm. With just one demonstration, the robot can learn to combine three actions—push, rotate, and lift—to complete a long-horizon task, and can automatically compose skills for new objects. It can even extend to natural language instruction scenarios, such as “set the breakfast table but don’t put snacks,” to complete variable long-horizon placement tasks.
🤖 In the era of large models, this approach is still irreplaceable
Core conclusion: Even if large models are very powerful, practical robots are inherently compositional systems, so the value of the neurosymbolic framework cannot be replaced.
With large models being so popular now, many people think we can just use large models end-to-end for robots, so why bother with neurosymbolic methods? The talk gives a very clear judgment:
First, any robot system that can be practically deployed is inherently compositional. To perform tasks, a robot needs perception, tracking, planning, and control—multiple different modules. It cannot be made into a pure end-to-end black box. The neurosymbolic framework provides a principled way to integrate these modules.
Second, large models have already learned a lot of common-sense semantic knowledge for us, so we don’t need to have robots learn it from scratch. Common-sense constraints can be directly obtained from pre-trained large language models and vision-language models, and then injected into the planning stage as conditions. This yields far higher data efficiency than learning from scratch. User preferences and additional requirements can also be added directly as constraints without retraining the entire model.
Multiple experiments verify the advantages of this method: for a dishwashing task, only 10 demonstrations of washing a single plate are needed to generalize to washing two plates and placing them in a dish rack, with the robot autonomously reasoning the correct order to avoid earlier plates blocking later ones. For a book-placing task, the training set contains at most two books aligned in a scene, yet the model generalizes to more books and entirely new obstacles. These are all tasks that end-to-end models (which directly output actions from sensor data without modular training) fundamentally cannot achieve.
🧩 A clear path to general-purpose robots
Core conclusion: Combining models with different expertise and allowing the system to autonomously iterate—this approach has already reached the threshold of general intelligence.
The ultimate goal of this framework is to achieve general-purpose physical intelligent robots capable of few-shot learning (learning new tasks with very few examples) and generalization (transferring learned abilities to unseen scenes and tasks). These are the human-level capabilities that general-purpose robots must possess.
Its core logic is very clear: let models with different expertise do what they are good at—vision-language models understand task goals and semantics, future-state prediction modules verify physical feasibility, diffusion models (a class of generative AI models that can produce multiple results satisfying constraints, used here to generate robot motion trajectories) generate multiple feasible motion trajectories and poses, and finally neurosymbolic reasoning integrates all results to select the optimal solution before execution.
In the long term, this framework supports autonomous continuous iterative learning by robots: starting from basic small skills, robots can explore the physical world by themselves, acquire new experience, and continuously upgrade their own models without requiring humans to constantly provide new training data.
The research team has also open-sourced a closed-loop robotic agent programming framework called Retriever following this approach, for use by the entire field. It has already been validated on multiple real-world robotic tasks.
💡 Key quotes
-
Robots should be like humans: think first, then act—not practice millions of times to learn a skill.
-
Don’t learn everything from scratch from data; physical laws and common sense already exist—just use them.
-
For general-purpose robots to reach human level, they need to learn from one demonstration and work in new scenarios.
-
Any practical robot system is inherently compositional; it cannot be a pure end-to-end black box.
-
In the era of large models, the neurosymbolic approach still has irreplaceable research and application value.
-
We are far from fully exploiting the existing knowledge within large models.
Similar Articles
@seclink: https://x.com/seclink/status/2057093284330430533
NVIDIA's head of robotics, Jim Fan, gave a public talk, advocating that robots should directly replicate the successful path of large language models. He proposed directions such as World Action Model (WAM), a data revolution based on human first-person video, and neural simulation, and predicted a 95% probability of achieving the endgame of general-purpose physical robots by 2040.
@seclink: https://x.com/seclink/status/2067968283492712846
This article, based on the sharing of researcher Victoria Lin, systematically reviews the mainstream technical approaches of native multimodal large models (Chameleon, Transfusion, MOT) and their pros and cons. It points out that multimodal AI is still in the early exploration stage, with open problems such as gaps in scaling laws, inconsistency between image understanding and generation encoding, and connection with the physical world.
@seclink: 如果想入门机器人,可以学习英伟达的开源资料。 NVIDIA Isaac Sim™ is an open-source application on NVIDIA Omniverse for developing, simulating, …
NVIDIA Isaac Sim 是一个基于 Omniverse 的开源机器人仿真平台,支持在逼真的虚拟环境中开发、测试和部署 AI 驱动的机器人,具备强化学习、ROS 集成等功能。
@seclink: 5. Open-Source Acceleration of Robot World Models - NVIDIA Cosmos 3 + Isaac GR00T: Physical AI Foundation Models - AGIBOT Genie Sim 3.0: The First Fully Open-Source Robot Simulation Platform (Complete Open Source of Code, Data, and Assets) - VLA (Vision-…
Robot world models and simulation platforms are experiencing open-source acceleration: NVIDIA launched Cosmos 3 and Isaac GR00T physical AI foundation models, AGIBOT released Genie Sim 3.0, a fully open-source simulation platform, VLA models become mainstream for manipulation policies, collectively lowering the entry barrier for the robotics field.
@seclink: Robot World Models (New Dimension, 0 Deduplication = New Information) Core Projects: - Awesome-WAM (OpenMOSS): Comprehensive Paper List of World Action Models, including DreamDojo (General-Purpose Robot World Model Learned from Human Videos) - awe…
Introduces two projects related to robot world models: Awesome-WAM (OpenMOSS) includes papers such as World Action Models and DreamDojo; awesome-physical-ai curates a collection of papers on VLA models, world models, and embodied foundation models (including NVIDIA Cosmos Predict2.5).