GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Hugging Face Daily Papers 06/03/26, 12:00 AM Papers

Summary

GRAIL generates diverse humanoid manipulation and locomotion data using 3D assets and video foundation models, enabling effective sim-to-real transfer for humanoid robot control with high real-world success rates.

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

Original Article

View Cached Full Text

Cached at: 06/04/26, 03:41 AM

Paper page - GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Source: https://huggingface.co/papers/2606.05160 Authors:

Abstract

GRAIL generates diverse humanoid manipulation and locomotion data through 3D asset composition and video foundation models, enabling effective sim-to-real transfer for robot control.

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors fromvideo foundation models(VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4Dhuman-object interaction(HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to ahumanoid robotand train complementary task-general trackers: anobject-aware latent adaptorfor manipulation and ascene-aware trackerfor terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we trainegocentric visual policiesthrough asim-to-real pipelineand deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2606\.05160

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05160 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05160 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05160 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Paper page - GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Submit Feedback

Similar Articles

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System