Humanoid-GPT: 제로샷 모션 트래킹을 위한 데이터 및 구조 확장

초록

우리는 Humanoid-GPT를 소개한다. 이는 인과적 주의를 갖춘 GPT 스타일의 트랜스포머로, 전신 제어를 위해 10억 프레임 규모의 모션 코퍼스에서 훈련되었다. 희소한 데이터와 민첩성-일반화 트레이드오프에 의해 제약받던 기존의 얕은 MLP 트래커와 달리, Humanoid-GPT는 모든 주요 모션 캡처 데이터셋과 대규모 자체 녹화 데이터를 통합한 20억 프레임의 리타겟팅 코퍼스에서 사전 훈련되었다. 데이터와 모델 용량을 모두 확장함으로써, 매우 동적인 행동을 추적하면서도 보지 못한 동작 및 제어 작업에 대해 전례 없는 제로샷 일반화를 달성하는 단일 생성형 트랜스포머를 얻을 수 있었다. 광범위한 실험과 스케일링 분석을 통해 우리의 모델이 새로운 성능 최전선을 구축하며, 매우 동적이고 복잡한 동작을 동시에 추적하면서 보지 못한 작업에 대한 강건한 제로샷 일반화를 입증함을 보여준다.

English

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.