로보레퍼: 로보틱스를 위한 시각-언어 모델에서 추론을 통한 공간적 참조 연구

초록

공간 참조(spatial referring)는 구현된 로봇이 3D 물리 세계와 상호작용하기 위한 기본적인 능력입니다. 그러나 강력한 사전 학습된 시각-언어 모델(VLMs)이 있음에도 불구하고, 최근 접근법들은 여전히 복잡한 3D 장면을 정확히 이해하고 지시된 위치에 대해 동적으로 추론하여 상호작용할 수 있는 수준에 이르지 못했습니다. 이를 위해 우리는 RoboRefer를 제안합니다. RoboRefer는 감독된 미세 조정(SFT)을 통해 분리되었지만 전용 깊이 인코더를 통합함으로써 정확한 공간 이해를 먼저 달성할 수 있는 3D 인식 VLM입니다. 더 나아가, RoboRefer는 공간 참조 작업에 맞춤화된 메트릭-민감 프로세스 보상 함수를 통해 강화 미세 조정(RFT)을 거쳐 일반화된 다단계 공간 추론을 발전시킵니다. SFT와 RFT 훈련을 지원하기 위해, 우리는 20M QA 쌍(기존의 2배)으로 구성된 대규모 데이터셋인 RefSpatial을 소개합니다. 이 데이터셋은 31개의 공간 관계(기존 15개 대비)를 포함하며 최대 5단계의 복잡한 추론 과정을 지원합니다. 또한, 다단계 추론을 통한 공간 참조 평가의 격차를 메우기 위한 도전적인 벤치마크인 RefSpatial-Bench를 소개합니다. 실험 결과, SFT로 훈련된 RoboRefer는 평균 89.6%의 성공률로 최첨단 공간 이해를 달성했습니다. RFT로 훈련된 RoboRefer는 모든 다른 베이스라인을 큰 차이로 능가하며, RefSpatial-Bench에서 평균 정확도 기준 Gemini-2.5-Pro를 17.4% 앞섰습니다. 특히, RoboRefer는 다양한 제어 정책과 통합되어 복잡한 실제 장면에서 다양한 로봇(예: UR5, G1 휴머노이드)에 걸친 장기적이고 동적인 작업을 실행할 수 있습니다.

English

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

로보레퍼: 로보틱스를 위한 시각-언어 모델에서 추론을 통한 공간적 참조 연구

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

초록

Support