GEM: 生成的監督が身体化知能を支援する

要旨

具現化視覚言語モデル（VLM）は、特に視覚言語行動フレームワークにおいて、ロボティクス分野で顕著な性能と汎化能力を示している。しかしながら、標準的なテキスト誘導型事前学習パラダイムが重視する高次元の意味的焦点と、具現化環境での実行に不可欠な低次元の空間的・物理的知識との間には、依然として大きな乖離が存在する。本論文では、この乖離を埋めるために設計された生成型教師付き具現化視覚言語モデルGEMを紹介する。我々は、深さマップ生成タスクをVLMの事前学習フェーズに直接統合することを提案する。この生成目的を主モデルと共に訓練することにより、具現化知能に substantial な改善が見られ、意味理解と物理操作能力の両方が大幅に向上することを確認した。このパラダイムを支援するため、我々は高品質な深さ教師データと組み合わせたグラウンディング、推論、計画データの混合を含む大規模データセットGEM-4Mを厳選し公開する。広範な実験により、GEMは多様な具現化ベンチマークで最先端の成果を達成することを示す。さらに、我々が展開した行動モデルGEM-VLAは、シミュレーション環境と実世界評価の両方で極めて優れたタスク実行能力を示す。コード、モデル、データセットはhttps://zhaorw02.github.io/GEM/で公開されている。

English

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/