Actionable World Representation

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

world-models object-representation neural-architecture point-clouds rgb-d-video digital-twin policy-learning

Summary

WorldString is a neural architecture that models object state manifolds from point clouds or RGB-D video streams, serving as a foundational component for physical world models with differentiable structure for policy learning integration.

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

Original Article

View Cached Full Text

Cached at: 05/19/26, 06:30 AM

Paper page - Actionable World Representation

Source: https://huggingface.co/papers/2605.18743

Abstract

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities withinworld models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionableobject representation. We propose WorldString, aneural architecturecapable of modeling thestate manifoldof real-world objects by learning directly frompoint cloudsorRGB-D video streams. Serving as a versatiledigital twin, it acts as a foundational building block for physicalworld models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration withpolicy learningandneural dynamics.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.18743

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18743 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18743 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18743 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Actionable World Representation

Paper page - Actionable World Representation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

WALL-WM: Carving World Action Modeling at the Event Joints

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

World Machine: Towards Generative World Modeling for Time-Series

Submit Feedback

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

WALL-WM: Carving World Action Modeling at the Event Joints

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

World Machine: Towards Generative World Modeling for Time-Series