視覚的具現化された脳：マルチモーダル大規模言語モデルに空間における視覚、思考、制御を可能にする

要旨

マルチモーダル大規模言語モデル（MLLMs）の著しい進歩により、それらを脚式ロボットなどの物理的実体に拡張することへの関心が高まっています。これには通常、MLLMsがマルチモーダル理解能力を習得するだけでなく、視覚的空間推論と物理的相互作用能力を統合することが求められます。しかし、既存の手法ではこれらの能力を統一することが困難です。本論文では、現実世界における知覚、推論、制御を統合するためのフレームワークであるVisual Embodied Brain（VeBrain）を提案します。VeBrainは、ロボット制御を2D視覚空間における一般的なテキストベースのMLLMタスクに再定式化し、異なるタスクの目的とマッピング空間を統一します。さらに、MLLMsからのテキスト制御信号を実ロボットの動作ポリシーに変換するための新しいロボットアダプタを提案します。データの観点から、VeBrainの様々な能力を網羅した高品質な指示データセットであるVeBrain-600kを導入します。VeBrain-600kでは、数百時間をかけてデータを収集、キュレーション、注釈し、マルチモーダル連鎖思考（CoT）を採用して異なる能力を単一の会話に統合します。13のマルチモーダルベンチマークと5の空間知能ベンチマークにおける広範な実験により、VeBrainがQwen2.5-VLなどの既存のMLLMsを凌駕する性能を示すことが実証されました。脚式ロボットやロボットアームに展開した場合、VeBrainは既存の手法と比較して強い適応性、柔軟性、および構成能力を示します。例えば、Qwen2.5-VLと比較して、VeBrainはMMVetで+5.6%の大幅な向上を達成するだけでなく、脚式ロボットタスクでも平均+50%の向上を実現しました。

English

The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.

視覚的具現化された脳：マルチモーダル大規模言語モデルに空間における視覚、思考、制御を可能にする

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

要旨

Support