GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
Summary
GRAIL generates diverse humanoid manipulation and locomotion data using 3D assets and video foundation models, enabling effective sim-to-real transfer for humanoid robot control with high real-world success rates.
View Cached Full Text
Cached at: 06/04/26, 03:41 AM
Paper page - GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
Source: https://huggingface.co/papers/2606.05160 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
GRAIL generates diverse humanoid manipulation and locomotion data through 3D asset composition and video foundation models, enabling effective sim-to-real transfer for robot control.
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors fromvideo foundation models(VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4Dhuman-object interaction(HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to ahumanoid robotand train complementary task-general trackers: anobject-aware latent adaptorfor manipulation and ascene-aware trackerfor terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we trainegocentric visual policiesthrough asim-to-real pipelineand deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2606\.05160
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05160 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05160 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05160 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation
OASIS is a simulation-data-driven framework for humanoid loco-manipulation that uses 3D generative models and hierarchical visuomotor policies. It achieves better zero-shot performance than real-robot training by leveraging domain randomization in simulation.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion proposes a physics-grounded reward system that evaluates kinematic plausibility, contact consistency, and dynamic feasibility of human motion in generated videos, achieving stronger correlation with human judgment and improving motion realism in RL-based post-training.
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.