WorldVLA：迈向自回归动作世界模型

摘要

我们提出了WorldVLA，一种自回归动作世界模型，它统一了动作与图像的理解与生成。我们的WorldVLA将视觉-语言-动作（VLA）模型与世界模型整合于单一框架之中。该世界模型通过结合动作与图像理解来预测未来图像，旨在学习环境的底层物理规律以优化动作生成。同时，动作模型基于图像观察生成后续动作，辅助视觉理解，进而促进世界模型的视觉生成。我们证明，WorldVLA在性能上超越了独立的动作模型和世界模型，凸显了世界模型与动作模型之间的相互增强效应。此外，我们发现，当以自回归方式生成动作序列时，动作模型的性能会下降。这一现象可归因于模型在动作预测上的泛化能力有限，导致早期动作的错误向后续动作传播。为解决此问题，我们提出了一种注意力掩码策略，在生成当前动作时选择性地屏蔽先前动作，这在动作片段生成任务中展现了显著的性能提升。

English

We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.

WorldVLA：迈向自回归动作世界模型

WorldVLA: Towards Autoregressive Action World Model

摘要

Support