GEM：生成式監督促進具身智能

摘要

具身視覺語言模型（VLMs）在機器人學中展現了出色的性能與泛化能力，尤其是在視覺-語言-動作框架中。然而，標準文字導向的預訓練範式所強調的高層次語義，與具身環境執行所需的低層次空間及物理知識之間，仍存在顯著鴻溝。本文提出GEM（生成式監督的具身視覺語言模型），旨在彌合此差距。我們提議在VLM預訓練階段直接整合深度圖生成任務。透過讓此生成目標與主模型共同訓練，我們觀察到具身智慧的顯著提升，大幅增強了語義理解與物理操作能力。為支持此範式，我們整理並發佈GEM-4M——一個包含接地、推理與規劃數據混合，並搭配高品質深度監督的大規模綜合數據集。廣泛的實驗證明，GEM在多樣化的具身基準測試中達到狀態最優的結果。此外，我們部署的動作模型GEM-VLA在模擬環境與真實世界評估中均展現出遠優於先前的任務執行能力。程式碼、模型與數據集皆可於 https://zhaorw02.github.io/GEM/ 取得。

English

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/