3D-VLA：一個3D視覺-語言-動作生成世界模型

摘要

最近的視覺-語言-動作（VLA）模型依賴2D輸入，缺乏與3D物理世界更廣泛整合。此外，它們通過從感知到動作的直接映射來執行動作預測，忽略了世界的龐大動態和動作與動態之間的關係。相比之下，人類擁有描繪未來場景想像以相應規劃動作的世界模型。為此，我們提出了3D-VLA，通過引入一個新的具身基礎模型家族，無縫地將3D感知、推理和動作通過生成式世界模型相連。具體來說，3D-VLA建立在基於3D的大型語言模型（LLM）之上，並引入一組交互標記以與具身環境互動。此外，為了將生成能力注入模型，我們訓練了一系列具身擴散模型，並將它們與LLM對齊以預測目標圖像和點雲。為了訓練我們的3D-VLA，我們通過從現有機器人數據集中提取大量3D相關信息來編纂了一個大規模的3D具身指令數據集。我們在保留數據集上的實驗表明，3D-VLA顯著改善了具身環境中的推理、多模態生成和規劃能力，展示了其在實際應用中的潛力。

English

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

3D-VLA：一個3D視覺-語言-動作生成世界模型

3D-VLA: A 3D Vision-Language-Action Generative World Model

摘要

Support