@seclink: https://x.com/seclink/status/2067970118873993482

X AI KOLs Following 06/19/26, 01:57 PM Papers

Summary

Current mainstream pure data-driven robot solutions suffer from low data efficiency and poor generalization. The newly proposed neuro-symbolic physical intelligence paradigm breaks down tasks into two steps: world modeling and planning. It requires only 1-10 demonstrations to learn new tasks, and its generalization ability far exceeds traditional end-to-end solutions, providing a more reliable path for general-purpose robots.

https://t.co/UTUKOWQpXd

Original Article

View Cached Full Text

Cached at: 06/20/26, 06:20 PM

Why AI Robots Learn New Tasks Far Worse Than Ordinary Humans

📌 Source: https://www.youtube.com/watch?v=3W36pd50Wqw

⚡ Key insight upfront: Robots need to be like humans: think first, then act—not practice millions of times to learn one skill.
The current mainstream pure data-driven approach to robotic physical intelligence has inherent flaws of low data efficiency and poor generalization. Learning a new task requires hundreds of times more samples than humans.
The newly proposed neurosymbolic physical intelligence paradigm decomposes tasks into “world modeling + planning.” It can learn a new task with only 1–10 demonstrations, with generalization capabilities far exceeding traditional end-to-end methods.
In the era of large models, the neurosymbolic approach is still irreplaceable—it’s a reliable path toward general-purpose robots.

Have you noticed a strange contrast? Generative AI can already write code, draw images, and pass professional exams, but AI robots learning a new task are hundreds of times slower than ordinary humans. Folding a box requires 100 hours of training data; operating simple excavator actions requires collecting 200 demonstrations. If you change to a new object not seen during training, the robot fails completely. Behind this contrast lies a fundamental limitation in the current mainstream approach to robotic physical intelligence.

In a cutting-edge academic talk, Jiajun—currently a researcher at Amazon and soon joining U Penn as a system professor—proposed a completely new paradigm for robotic physical intelligence. The core idea is to learn from humans: humans never learn a new task by millions of trial-and-error attempts. Instead, they watch a demonstration, mentally “rehearse” the entire process, and then act once they understand. This new paradigm has been validated in multiple tests, with performance far surpassing traditional methods. Perhaps this is the direction general-purpose robots should take.

💻 The current mainstream approach has inherent flaws

Core conclusion: Pure data-driven end-to-end methods are inherently incapable of achieving general-purpose robots.

What is currently called physical intelligence in robotics refers to AI that can perceive, understand language, and execute actions in the physical world. It is the core capability for robots to actually perform tasks. Many tech companies on the US West Coast follow the mainstream approach of equating intelligence with “fitting a function to a dataset”: directly training models to output the next action from historical observations, with all capability learned purely from data.

The flaws of this approach are critical:
First, extremely low data efficiency and poor generalization. A new task that a human can learn in one look requires hundreds of hours of training and hundreds of demonstrations for a robot. Changing to a new object or scene not seen during training causes immediate failure.
Second, no compositionality. Traditional methods learn only isolated actions that cannot be assembled to complete complex tasks—complex tasks require that preceding actions satisfy constraints for subsequent actions (e.g., first picking up a cup then hanging it; the location of picking must match the hanging requirement). The traditional approach simply cannot achieve such matching.

🧠 New paradigm: Robots “think first,” then act

Core conclusion: Decompose tasks into “world modeling + planning,” use neurosymbolic methods for generalization, and learn new tasks with just a few samples.

The core of this proposed paradigm is to align with human intelligence: humans do not perceive the world by only recognizing pixels; they abstract objects and their properties, mentally simulate outcomes of different actions before acting, and then derive a task plan. Applied to robots, this means decomposing physical intelligence into two steps: world modeling and motion planning, using neurosymbolic concepts for abstraction—simply put, combining the learning ability of neural networks with the abstract compositional ability of symbolic systems to break states and actions into composable modules, making planning easier for robots.

The new paradigm also changes how actions are modeled: instead of directly learning actions, it formulates action generation as a constrained optimization problem—first define a set of conditions (constraints) that the task must satisfy, then find the optimal solution that meets all conditions. Many rules that humans have already thoroughly studied, such as rigid-body physics and geometric constraints, can directly use existing mature models without learning everything from zero from data. Only the parts relevant to the current task need to be learned.

The experimental results of this approach are very impressive: with only one demonstration of hanging a coat hanger, the robot can generalize to hang a mug and hang entirely new 3D-printed letters, with success rates exceeding 90%. In contrast, the traditional pure end-to-end method achieves 0% success rate, and other improved methods perform far below the new paradigm. With just one demonstration, the robot can learn to combine three actions—push, rotate, and lift—to complete a long-horizon task, and can automatically compose skills for new objects. It can even extend to natural language instruction scenarios, such as “set the breakfast table but don’t put snacks,” to complete variable long-horizon placement tasks.

🤖 In the era of large models, this approach is still irreplaceable

Core conclusion: Even if large models are very powerful, practical robots are inherently compositional systems, so the value of the neurosymbolic framework cannot be replaced.

With large models being so popular now, many people think we can just use large models end-to-end for robots, so why bother with neurosymbolic methods? The talk gives a very clear judgment:

First, any robot system that can be practically deployed is inherently compositional. To perform tasks, a robot needs perception, tracking, planning, and control—multiple different modules. It cannot be made into a pure end-to-end black box. The neurosymbolic framework provides a principled way to integrate these modules.

Second, large models have already learned a lot of common-sense semantic knowledge for us, so we don’t need to have robots learn it from scratch. Common-sense constraints can be directly obtained from pre-trained large language models and vision-language models, and then injected into the planning stage as conditions. This yields far higher data efficiency than learning from scratch. User preferences and additional requirements can also be added directly as constraints without retraining the entire model.

Multiple experiments verify the advantages of this method: for a dishwashing task, only 10 demonstrations of washing a single plate are needed to generalize to washing two plates and placing them in a dish rack, with the robot autonomously reasoning the correct order to avoid earlier plates blocking later ones. For a book-placing task, the training set contains at most two books aligned in a scene, yet the model generalizes to more books and entirely new obstacles. These are all tasks that end-to-end models (which directly output actions from sensor data without modular training) fundamentally cannot achieve.

🧩 A clear path to general-purpose robots

Core conclusion: Combining models with different expertise and allowing the system to autonomously iterate—this approach has already reached the threshold of general intelligence.

The ultimate goal of this framework is to achieve general-purpose physical intelligent robots capable of few-shot learning (learning new tasks with very few examples) and generalization (transferring learned abilities to unseen scenes and tasks). These are the human-level capabilities that general-purpose robots must possess.

Its core logic is very clear: let models with different expertise do what they are good at—vision-language models understand task goals and semantics, future-state prediction modules verify physical feasibility, diffusion models (a class of generative AI models that can produce multiple results satisfying constraints, used here to generate robot motion trajectories) generate multiple feasible motion trajectories and poses, and finally neurosymbolic reasoning integrates all results to select the optimal solution before execution.

In the long term, this framework supports autonomous continuous iterative learning by robots: starting from basic small skills, robots can explore the physical world by themselves, acquire new experience, and continuously upgrade their own models without requiring humans to constantly provide new training data.

The research team has also open-sourced a closed-loop robotic agent programming framework called Retriever following this approach, for use by the entire field. It has already been validated on multiple real-world robotic tasks.

💡 Key quotes

Robots should be like humans: think first, then act—not practice millions of times to learn a skill.
Don’t learn everything from scratch from data; physical laws and common sense already exist—just use them.
For general-purpose robots to reach human level, they need to learn from one demonstration and work in new scenarios.
Any practical robot system is inherently compositional; it cannot be a pure end-to-end black box.
In the era of large models, the neurosymbolic approach still has irreplaceable research and application value.
We are far from fully exploiting the existing knowledge within large models.

@seclink: https://x.com/seclink/status/2067970118873993482

Why AI Robots Learn New Tasks Far Worse Than Ordinary Humans

💻 The current mainstream approach has inherent flaws

🧠 New paradigm: Robots “think first,” then act

🤖 In the era of large models, this approach is still irreplaceable

🧩 A clear path to general-purpose robots

💡 Key quotes

Similar Articles

@seclink: https://x.com/seclink/status/2057093284330430533

@seclink: https://x.com/seclink/status/2067968283492712846

@seclink: 如果想入门机器人，可以学习英伟达的开源资料。 NVIDIA Isaac Sim™ is an open-source application on NVIDIA Omniverse for developing, simulating, …

@seclink: 5. Open-Source Acceleration of Robot World Models - NVIDIA Cosmos 3 + Isaac GR00T: Physical AI Foundation Models - AGIBOT Genie Sim 3.0: The First Fully Open-Source Robot Simulation Platform (Complete Open Source of Code, Data, and Assets) - VLA (Vision-…

@seclink: Robot World Models (New Dimension, 0 Deduplication = New Information) Core Projects: - Awesome-WAM (OpenMOSS): Comprehensive Paper List of World Action Models, including DreamDojo (General-Purpose Robot World Model Learned from Human Videos) - awe…

Submit Feedback

Similar Articles

@seclink: https://x.com/seclink/status/2057093284330430533

@seclink: https://x.com/seclink/status/2067968283492712846

@seclink: 如果想入门机器人，可以学习英伟达的开源资料。 NVIDIA Isaac Sim™ is an open-source application on NVIDIA Omniverse for developing, simulating, …

@seclink: 5. Open-Source Acceleration of Robot World Models - NVIDIA Cosmos 3 + Isaac GR00T: Physical AI Foundation Models - AGIBOT Genie Sim 3.0: The First Fully Open-Source Robot Simulation Platform (Complete Open Source of Code, Data, and Assets) - VLA (Vision-…

@seclink: Robot World Models (New Dimension, 0 Deduplication = New Information) Core Projects: - Awesome-WAM (OpenMOSS): Comprehensive Paper List of World Action Models, including DreamDojo (General-Purpose Robot World Model Learned from Human Videos) - awe…