人形GPT：扩展数据与结构实现零样本动作追踪

摘要

我们介绍Humanoid-GPT，一个采用因果注意力的GPT风格Transformer，它在十亿级运动语料上训练，用于全身控制。与以往受限于数据稀缺和敏捷性-泛化权衡的浅层MLP追踪器不同，Humanoid-GPT在20亿帧重定向语料上进行预训练，该语料统一了所有主要动作捕捉数据集和大型内部录制数据。通过扩展数据和模型容量，我们得到一个单一的生成式Transformer，既能追踪高度动态的行为，又能对未见过的动作和控制任务实现前所未有的零样本泛化。大量实验和扩展性分析表明，我们的模型建立了新的性能基准，在追踪高度动态复杂动作的同时，展现出对未见任务的鲁棒零样本泛化能力。

English

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.