RoboTracer:機器人視覺語言模型中基於推理的空間軌跡掌控技術
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
December 15, 2025
作者: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
cs.AI
摘要
空間追蹤作為機器人的基礎具身互動能力,本質上具有挑戰性,因為它需要結合多步驟的度量基礎推理與複雜的空間指代及真實世界度量測量。然而,現有方法難以應對此組合式任務。為此,我們提出RoboTracer——首個通過通用空間編碼器與迴歸監督解碼器實現3D空間指代與測量的3D感知視覺語言模型,在監督微調階段增強尺度感知能力。此外,RoboTracer通過帶有度量敏感過程獎勵的強化微調,推進多步驟度量基礎推理,監督關鍵中間感知線索以精確生成空間軌跡。為支持監督微調與強化微調訓練,我們構建了TraceSpatial大規模數據集,包含3千萬問答對,涵蓋室外/室內/桌面場景,並支持多達9步的複雜推理過程。我們進一步提出TraceSpatial-Bench基準測試,填補了空間追蹤評估的空白。實驗結果表明,RoboTracer在空間理解、測量與指代方面均超越基線模型,平均成功率達79.1%,並在TraceSpatial-Bench上以顯著優勢實現尖端性能,準確率較Gemini-2.5-Pro高出36%。值得注意的是,RoboTracer可與多種控制策略集成,在雜亂的真實場景中為不同機器人(UR5、G1人形機器人)執行長時程動態任務。
English
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.