Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Hugging Face Daily Papers 06/02/26, 12:00 AM Papers

humanoid motion-tracking zero-shot transformer pre-training whole-body-control scaling

Summary

Humanoid-GPT is a GPT-style Transformer pre-trained on a billion-scale motion corpus, achieving zero-shot generalization for whole-body motion tracking across unseen motions and tasks.

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

Original Article

View Cached Full Text

Cached at: 06/03/26, 03:35 AM

Paper page - Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Source: https://huggingface.co/papers/2606.03985 Published on Jun 2

#3 Paper of the day Authors:

Abstract

Humanoid-GPT is a GPT-style Transformer with causal attention trained on a billion-scale motion corpus that achieves zero-shot generalization to unseen motions and control tasks through scalable pre-training on diverse motion data.

We introduce Humanoid-GPT, a GPT-styleTransformerwithcausal attentiontrained on a billion-scalemotion corpusfor whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frameretargeted corpusthat unifies all majormocap datasetswith large-scale in-house recordings. Scaling both data and model capacity yields a singlegenerative Transformerthat tracks highlydynamic behaviorswhile achieving unprecedentedzero-shot generalizationto unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robustzero-shot generalizationto unseen tasks while simultaneously tracking highly dynamic and complex motions.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.03985

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.03985 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.03985 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.03985 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Paper page - Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Gait2Hip-60: A Unified Deep Learning Benchmark for Predicting Hip Muscle Forces and Joint Moments from Multi-Cadence Gait Kinematics

Image GPT

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Better language models and their implications

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Submit Feedback

Similar Articles

Gait2Hip-60: A Unified Deep Learning Benchmark for Predicting Hip Muscle Forces and Joint Moments from Multi-Cadence Gait Kinematics

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Better language models and their implications

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation