WorldVLA: 자기회귀적 행동 세계 모델을 향하여

초록

우리는 행동과 이미지 이해 및 생성을 통합한 자기회귀적 행동 세계 모델인 WorldVLA를 소개한다. 우리의 WorldVLA는 Vision-Language-Action(VLA) 모델과 세계 모델을 단일 프레임워크로 통합한다. 세계 모델은 행동과 이미지 이해를 활용하여 미래 이미지를 예측함으로써 환경의 기본 물리를 학습하여 행동 생성을 개선하는 것을 목표로 한다. 한편, 행동 모델은 이미지 관찰을 기반으로 후속 행동을 생성하여 시각적 이해를 돕고, 이는 다시 세계 모델의 시각적 생성에 기여한다. 우리는 WorldVLA가 독립적인 행동 모델과 세계 모델을 능가하며, 세계 모델과 행동 모델 간의 상호 강화를 강조한다. 또한, 우리는 자기회귀 방식으로 행동 시퀀스를 생성할 때 행동 모델의 성능이 저하되는 현상을 발견했다. 이 현상은 행동 예측에 대한 모델의 제한된 일반화 능력으로 인해 초기 행동의 오류가 후속 행동으로 전파되기 때문으로 볼 수 있다. 이 문제를 해결하기 위해, 우리는 현재 행동 생성 시 이전 행동을 선택적으로 마스킹하는 주의 마스크 전략을 제안하며, 이는 행동 청크 생성 작업에서 상당한 성능 향상을 보여준다.

English

We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.

WorldVLA: 자기회귀적 행동 세계 모델을 향하여

WorldVLA: Towards Autoregressive Action World Model

초록

Support