Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
Summary
Humanoid-GPT is a GPT-style Transformer pre-trained on a billion-scale motion corpus, achieving zero-shot generalization for whole-body motion tracking across unseen motions and tasks.
View Cached Full Text
Cached at: 06/03/26, 03:35 AM
Paper page - Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
Source: https://huggingface.co/papers/2606.03985 Published on Jun 2
#3 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
Humanoid-GPT is a GPT-style Transformer with causal attention trained on a billion-scale motion corpus that achieves zero-shot generalization to unseen motions and control tasks through scalable pre-training on diverse motion data.
We introduce Humanoid-GPT, a GPT-styleTransformerwithcausal attentiontrained on a billion-scalemotion corpusfor whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frameretargeted corpusthat unifies all majormocap datasetswith large-scale in-house recordings. Scaling both data and model capacity yields a singlegenerative Transformerthat tracks highlydynamic behaviorswhile achieving unprecedentedzero-shot generalizationto unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robustzero-shot generalizationto unseen tasks while simultaneously tracking highly dynamic and complex motions.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.03985
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.03985 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.03985 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.03985 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Gait2Hip-60: A Unified Deep Learning Benchmark for Predicting Hip Muscle Forces and Joint Moments from Multi-Cadence Gait Kinematics
This paper introduces Gait2Hip-60, a benchmark dataset and deep learning framework for predicting hip muscle forces and joint moments from gait kinematics, comparing LSTM, Transformer, and Mamba models. Transformer achieved the best performance, with moderate zero-shot generalization to pathological gait.
Image GPT
OpenAI's Image GPT (iGPT) applies GPT-2 transformers to pixel sequences for image generation and classification, demonstrating that the same architecture used for language can learn coherent visual features in an unsupervised manner and achieve competitive performance on image classification benchmarks.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.
Better language models and their implications
OpenAI introduces GPT-2, a 1.5 billion parameter transformer-based language model trained on 40GB of internet text that achieves state-of-the-art performance on language modeling benchmarks and demonstrates zero-shot capabilities in reading comprehension, translation, question answering, and summarization. Due to safety concerns, only a smaller model and technical paper are released publicly rather than the full trained model.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion proposes a physics-grounded reward system that evaluates kinematic plausibility, contact consistency, and dynamic feasibility of human motion in generated videos, achieving stronger correlation with human judgment and improving motion realism in RL-based post-training.