人形機器人的運動作為下一個標記預測。

摘要

我們將現實世界中的人形控制視為下一個標記預測問題，類似於語言中預測下一個單詞。我們的模型是一個因果Transformer，通過感覺運動軌跡的自回歸預測進行訓練。為了考慮數據的多模態性，我們以模態對齊的方式進行預測，對於每個輸入標記，從相同模態中預測下一個標記。這種通用的制定方式使我們能夠利用具有缺失模態的數據，例如沒有動作的視頻軌跡。我們在一組來自先前神經網絡策略、基於模型的控制器、運動捕捉數據和YouTube人類視頻的模擬軌跡上訓練我們的模型。我們展示了我們的模型使一個全尺寸的人形機器人能夠在舊金山實現零樣本行走。我們的模型可以在僅訓練了27小時的行走數據的情況下轉移到現實世界，並且可以泛化到訓練中未見過的命令，如向後行走。這些發現表明了通過生成感覺運動軌跡的建模來學習具有挑戰性的現實世界控制任務的有前途的途徑。

English

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

人形機器人的運動作為下一個標記預測。

Humanoid Locomotion as Next Token Prediction

摘要

Support