Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Summary
Cortex 2.0 introduces a plan-and-act control framework that uses visual latent space trajectory generation to enable reliable long-horizon robotic manipulation in complex industrial environments, outperforming reactive Vision-Language-Action models.
View Cached Full Text
Cached at: 04/23/26, 03:35 AM
Paper page - Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Source: https://huggingface.co/papers/2604.20246 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Cortex 2.0 enables reliable long-horizon robotic manipulation through plan-and-act control that generates and evaluates future trajectories in visual latent space, outperforming reactive Vision-Language-Action models in complex industrial settings.
Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. WhileVision-Language-Action modelshave demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories invisual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate thatworld-model-based planningcan operate reliably in complex industrial environments.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.20246
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.20246 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.20246 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.20246 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 is a multi-modal world model framework that generates high-fidelity 3D Gaussian Splatting scenes from text, images, and videos through specialized modules for panorama generation, trajectory planning, and scene composition, achieving state-of-the-art performance among open-source approaches.
MolmoAct2: Action Reasoning Models for Real-world Deployment
Allen AI releases MolmoAct2, an open-weight Vision-Language-Action model designed for real-world robotic deployment, featuring new datasets, an open action tokenizer, and adaptive reasoning to reduce latency.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World introduces a self-evolving training framework for general agent intelligence that autonomously discovers real-world environments and tasks via the Model Context Protocol, enabling continuous learning. Agent-World-8B and 14B models outperform strong proprietary models across 23 challenging agent benchmarks.
Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations
NXP and Hugging Face demonstrate techniques for deploying Vision-Language-Action (VLA) models on embedded robotic platforms, covering dataset recording best practices, VLA fine-tuning, and on-device optimizations including quantization and asynchronous inference scheduling for the i.MX 95 processor.
Plan online, learn offline: Efficient learning and exploration via model-based control
OpenAI proposes POLO (Plan Online, Learn Offline), a framework combining model-based control with value function learning and coordinated exploration to enable efficient learning on complex control tasks like humanoid locomotion and dexterous manipulation with minimal real-world experience.