RoboRefer:面向机器人视觉语言模型的空间指代推理研究
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
June 4, 2025
作者: Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang
cs.AI
摘要
空間指稱是具身機器人與三維物理世界互動的基本能力。然而,即使借助強大的預訓練視覺語言模型(VLMs),現有方法仍難以精確理解複雜的三維場景並動態推理指令指示的互動位置。為此,我們提出了RoboRefer,這是一種具備三維感知能力的VLM,其首先通過監督微調(SFT)整合解耦但專用的深度編碼器,實現精確的空間理解。此外,RoboRefer通過強化微調(RFT)推進廣義的多步空間推理,並針對空間指稱任務設計了度量敏感的過程獎勵函數。為支持SFT和RFT訓練,我們引入了RefSpatial,這是一個包含2000萬問答對(是之前的2倍)的大規模數據集,涵蓋31種空間關係(之前為15種),並支持複雜的推理過程(最多5步)。此外,我們還提出了RefSpatial-Bench,這是一個填補多步推理空間指稱評估空白的挑戰性基準。實驗表明,經過SFT訓練的RoboRefer在空間理解方面達到了最先進水平,平均成功率為89.6%。經過RFT訓練的RoboRefer進一步大幅超越所有其他基線,甚至在RefSpatial-Bench上的平均準確率超過Gemini-2.5-Pro達17.4%。值得注意的是,RoboRefer可與多種控制策略集成,在多樣化機器人(如UR5、G1人形機器人)上執行長時程、動態任務,並在雜亂的真實場景中表現出色。
English
Spatial referring is a fundamental capability of embodied robots to interact
with the 3D physical world. However, even with the powerful pretrained vision
language models (VLMs), recent approaches are still not qualified to accurately
understand the complex 3D scenes and dynamically reason about the
instruction-indicated locations for interaction. To this end, we propose
RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding
by integrating a disentangled but dedicated depth encoder via supervised
fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial
reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process
reward functions tailored for spatial referring tasks. To support SFT and RFT
training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x
prior), covering 31 spatial relations (vs. 15 prior) and supporting complex
reasoning processes (up to 5 steps). In addition, we introduce
RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial
referring with multi-step reasoning. Experiments show that SFT-trained
RoboRefer achieves state-of-the-art spatial understanding, with an average
success rate of 89.6%. RFT-trained RoboRefer further outperforms all other
baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average
accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various
control policies to execute long-horizon, dynamic tasks across diverse robots
(e,g., UR5, G1 humanoid) in cluttered real-world scenes.