Actionable World Representation
Summary
WorldString is a neural architecture that models object state manifolds from point clouds or RGB-D video streams, serving as a foundational component for physical world models with differentiable structure for policy learning integration.
View Cached Full Text
Cached at: 05/19/26, 06:30 AM
Paper page - Actionable World Representation
Source: https://huggingface.co/papers/2605.18743
Abstract
WorldString is a neural architecture that models object state manifolds from point clouds or RGB-D video streams, serving as a foundational component for physical world models with differentiable structure for policy learning integration.
Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities withinworld models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionableobject representation. We propose WorldString, aneural architecturecapable of modeling thestate manifoldof real-world objects by learning directly frompoint cloudsorRGB-D video streams. Serving as a versatiledigital twin, it acts as a foundational building block for physicalworld models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration withpolicy learningandneural dynamics.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.18743
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18743 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18743 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18743 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Light-WAM is a lightweight world action model for efficient robot manipulation that uses a compact video backbone and downsampled latent space for future-video supervision, achieving high performance with low inference latency.
Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision
This paper demonstrates that training a world model through random physical exploration leads to latent representations that encode spatial semantic structure (direction and position) without any linguistic supervision, highlighting physical geometry as the organizing principle.
WALL-WM: Carving World Action Modeling at the Event Joints
WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper introduces World-Language-Action (WLA) models, embodied foundation models that jointly predict textual subtasks, subgoal images, and robot actions from text, images, and robot states, achieving state-of-the-art multi-task and long-horizon learning in simulated and real-world environments.
World Machine: Towards Generative World Modeling for Time-Series
World Machine proposes a transformer-based generative world modeling architecture for time series that uses latent states to adapt to varying context lengths, addressing the quadratic memory cost of traditional transformers. Experiments on a synthetic dataset validate its feasibility and show improvements over conventional transformers.