Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Hugging Face Daily Papers Papers

Summary

Humanoid-GPT is a GPT-style Transformer pre-trained on a billion-scale motion corpus, achieving zero-shot generalization for whole-body motion tracking across unseen motions and tasks.

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
Original Article
View Cached Full Text

Cached at: 06/03/26, 03:35 AM

Paper page - Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Source: https://huggingface.co/papers/2606.03985 Published on Jun 2

#3 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

Humanoid-GPT is a GPT-style Transformer with causal attention trained on a billion-scale motion corpus that achieves zero-shot generalization to unseen motions and control tasks through scalable pre-training on diverse motion data.

We introduce Humanoid-GPT, a GPT-styleTransformerwithcausal attentiontrained on a billion-scalemotion corpusfor whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frameretargeted corpusthat unifies all majormocap datasetswith large-scale in-house recordings. Scaling both data and model capacity yields a singlegenerative Transformerthat tracks highlydynamic behaviorswhile achieving unprecedentedzero-shot generalizationto unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robustzero-shot generalizationto unseen tasks while simultaneously tracking highly dynamic and complex motions.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.03985

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.03985 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.03985 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.03985 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Image GPT

OpenAI Blog

OpenAI's Image GPT (iGPT) applies GPT-2 transformers to pixel sequences for image generation and classification, demonstrating that the same architecture used for language can learn coherent visual features in an unsupervised manner and achieve competitive performance on image classification benchmarks.

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Hugging Face Daily Papers

HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.

Better language models and their implications

OpenAI Blog

OpenAI introduces GPT-2, a 1.5 billion parameter transformer-based language model trained on 40GB of internet text that achieves state-of-the-art performance on language modeling benchmarks and demonstrates zero-shot capabilities in reading comprehension, translation, question answering, and summarization. Due to safety concerns, only a smaller model and technical paper are released publicly rather than the full trained model.