GEM: 生成式监督助力具身智能

摘要

具身视觉语言模型（VLMs）在机器人领域，尤其是视觉-语言-动作框架中，已展现出卓越的性能与泛化能力。然而，标准文本引导预训练范式所注重的语义高层理解，与具身环境中执行任务所需的关键空间与物理知识（属于低层信息）之间仍存在显著鸿沟。本文提出GEM（生成式监督具身视觉语言模型），旨在弥合这一差距。我们创新性地将深度图生成任务直接融入VLM预训练阶段，通过联合训练该生成目标与主模型，观察到具身智能的显著提升——语义理解能力与物理操作能力均得到增强。为支撑该范式，我们整理并发布了GEM-4M数据集，这是一个包含大规模混合型数据（涵盖定位、推理与规划任务，并配以高质量深度监督）的综合数据集。大量实验表明，GEM在多个具身基准测试中达到最先进水平。此外，我们部署的动作模型GEM-VLA在仿真环境与真实世界评估中均展现出极其优越的任务执行能力。代码、模型及数据集已开源至https://zhaorw02.github.io/GEM/。

English

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/