RoboRefer：面向机器人视觉语言模型的空间指代推理研究

摘要

空間指稱是具身機器人與三維物理世界互動的基本能力。然而，即使借助強大的預訓練視覺語言模型（VLMs），現有方法仍難以精確理解複雜的三維場景並動態推理指令指示的互動位置。為此，我們提出了RoboRefer，這是一種具備三維感知能力的VLM，其首先通過監督微調（SFT）整合解耦但專用的深度編碼器，實現精確的空間理解。此外，RoboRefer通過強化微調（RFT）推進廣義的多步空間推理，並針對空間指稱任務設計了度量敏感的過程獎勵函數。為支持SFT和RFT訓練，我們引入了RefSpatial，這是一個包含2000萬問答對（是之前的2倍）的大規模數據集，涵蓋31種空間關係（之前為15種），並支持複雜的推理過程（最多5步）。此外，我們還提出了RefSpatial-Bench，這是一個填補多步推理空間指稱評估空白的挑戰性基準。實驗表明，經過SFT訓練的RoboRefer在空間理解方面達到了最先進水平，平均成功率為89.6%。經過RFT訓練的RoboRefer進一步大幅超越所有其他基線，甚至在RefSpatial-Bench上的平均準確率超過Gemini-2.5-Pro達17.4%。值得注意的是，RoboRefer可與多種控制策略集成，在多樣化機器人（如UR5、G1人形機器人）上執行長時程、動態任務，並在雜亂的真實場景中表現出色。

English

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

RoboRefer：面向机器人视觉语言模型的空间指代推理研究

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

摘要

Support