RoboRefer：面向机器人视觉语言模型的空间指代推理

摘要

空间指代是具身机器人与三维物理世界交互的一项基本能力。然而，即便借助强大的预训练视觉语言模型（VLMs），现有方法仍难以准确理解复杂的三维场景，并动态推理出指令所指的交互位置。为此，我们提出了RoboRefer，这是一种具备三维感知能力的VLM，它首先通过监督微调（SFT）整合了一个解耦但专用的深度编码器，实现了精确的空间理解。此外，RoboRefer通过强化微调（RFT）推进了广义的多步空间推理，其中设计了针对空间指代任务的度量敏感过程奖励函数。为了支持SFT和RFT训练，我们引入了RefSpatial，一个包含2000万问答对（是之前的两倍）的大规模数据集，涵盖31种空间关系（之前为15种），并支持复杂的推理过程（最多5步）。同时，我们还推出了RefSpatial-Bench，一个填补了多步推理空间指代评估空白的挑战性基准。实验表明，经过SFT训练的RoboRefer在空间理解上达到了最先进水平，平均成功率高达89.6%。经过RFT训练的RoboRefer更是大幅超越所有基线模型，在RefSpatial-Bench上的平均准确率甚至超过了Gemini-2.5-Pro达17.4%。值得注意的是，RoboRefer能够与多种控制策略集成，在杂乱的真实世界场景中，跨多种机器人（如UR5、G1人形机器人）执行长时程、动态任务。

English

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

RoboRefer：面向机器人视觉语言模型的空间指代推理

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

摘要

Support