视觉具身大脑：让多模态大语言模型在空间中观察、思考与控制

摘要

多模态大语言模型（MLLMs）的显著进展，正吸引着越来越多的关注，以将其扩展至如腿式机器人等物理实体。这通常要求MLLMs不仅掌握多模态理解能力，还需整合视觉空间推理与物理交互能力。然而，现有方法因这些能力本质上的差异而难以统一。本文提出了视觉具身大脑（VeBrain），一个面向现实世界感知、推理与控制的统一框架。VeBrain将机器人控制重构为二维视觉空间中的通用文本型MLLM任务，从而统一了不同任务的目标与映射空间。随后，提出了一种新颖的机器人适配器，将MLLMs生成的文本控制信号转换为真实机器人的运动策略。从数据角度出发，我们进一步引入了VeBrain-600k，一个高质量指令数据集，涵盖了VeBrain的多种能力。在VeBrain-600k中，我们耗费数百小时收集、整理并标注数据，采用多模态思维链（CoT）将不同能力融合于单一对话中。在13个多模态基准和5个空间智能基准上的广泛实验表明，VeBrain相较于Qwen2.5-VL等现有MLLMs展现出卓越性能。当部署至腿式机器人与机械臂时，VeBrain相比现有方法显示出更强的适应性、灵活性与组合能力。例如，与Qwen2.5-VL相比，VeBrain不仅在MMVet上实现了+5.6%的显著提升，还在腿式机器人任务中平均增益高达+50%。

English

The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.

视觉具身大脑：让多模态大语言模型在空间中观察、思考与控制

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

摘要

Support