휴머노이드 보행을 다음 토큰 예측으로 접근하기

초록

우리는 실제 세계의 휴머노이드 제어를 언어에서 다음 단어를 예측하는 것과 유사한 다음 토큰 예측 문제로 재구성합니다. 우리의 모델은 감각운동 궤적(sensorimotor trajectories)의 자기회귀적 예측을 통해 학습된 인과적 트랜스포머(causal transformer)입니다. 데이터의 다중 모달리티 특성을 고려하기 위해, 우리는 모달리티에 맞춰 예측을 수행하며, 각 입력 토큰에 대해 동일한 모달리티의 다음 토큰을 예측합니다. 이러한 일반적인 공식화를 통해, 동작이 없는 비디오 궤적과 같이 일부 모달리티가 누락된 데이터도 활용할 수 있습니다. 우리는 이전의 신경망 정책, 모델 기반 제어기, 모션 캡처 데이터, 그리고 인간의 YouTube 비디오에서 나온 시뮬레이션 궤적 모음을 사용해 모델을 학습시킵니다. 우리의 모델은 샌프란시스코에서 풀사이즈 휴머노이드가 제로샷(zero-shot)으로 걷는 것을 가능하게 합니다. 이 모델은 단 27시간의 걷기 데이터로만 학습되었음에도 실제 세계로 전이할 수 있으며, 학습 중에 보지 못한 뒤로 걷기와 같은 명령에도 일반화할 수 있습니다. 이러한 결과는 감각운동 궤적의 생성 모델링을 통해 어려운 실제 세계 제어 작업을 학습하는 유망한 경로를 제시합니다.

English

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

휴머노이드 보행을 다음 토큰 예측으로 접근하기

Humanoid Locomotion as Next Token Prediction

초록

Support