RoboBrain 2.0 技术报告

摘要

我们推出RoboBrain 2.0，这是最新一代的具身视觉语言基础模型，旨在统一物理环境中复杂具身任务的感知、推理与规划能力。该模型提供两种版本：轻量级的7B模型和全规模的32B模型，采用视觉编码器与语言模型相结合的异构架构。尽管体积紧凑，RoboBrain 2.0在广泛的具身推理任务中展现出强劲性能。在空间与时间基准测试中，32B版本均取得领先成绩，超越了以往的开源及专有模型。特别地，它支持关键的现实世界具身AI能力，包括空间理解（如功能预测、空间指代、轨迹预测）和时间决策（如闭环交互、多智能体长时程规划、场景图更新）。本报告详述了模型架构、数据构建、多阶段训练策略、基础设施及实际应用。我们期望RoboBrain 2.0能推动具身AI研究，并为构建通用具身智能体迈出实用一步。代码、检查点及基准测试可在https://superrobobrain.github.io获取。

English

We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supports key real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene graph updating). This report details the model architecture, data construction, multi-stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advances embodied AI research and serves as a practical step toward building generalist embodied agents. The code, checkpoint and benchmark are available at https://superrobobrain.github.io.

RoboBrain 2.0 技术报告

RoboBrain 2.0 Technical Report

摘要

Support