人形机器人的运动作为下一个标记预测

摘要

我们将真实世界中的人形控制视为一个下一个标记预测问题，类似于语言中预测下一个单词。我们的模型是一个经由传感器运动轨迹的自回归预测训练的因果Transformer。为了考虑数据的多模态性，我们以模态对齐的方式进行预测，并针对每个输入标记从相同模态预测下一个标记。这种通用公式使我们能够利用具有缺失模态的数据，例如没有动作的视频轨迹。我们在来自先前神经网络策略、基于模型的控制器、动作捕捉数据和人类YouTube视频的模拟轨迹集合上训练我们的模型。我们展示了我们的模型使一个全尺寸的人形机器人能够在旧金山进行零样本行走。我们的模型可以在仅训练了27小时的行走数据的情况下转移到真实世界，并且可以推广到训练中未见过的命令，如向后行走。这些发现表明通过生成建模传感运动轨迹可能是学习具有挑战性的真实世界控制任务的一个有前途的途径。

English

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

人形机器人的运动作为下一个标记预测

Humanoid Locomotion as Next Token Prediction

摘要

Support