RoboTracer:机器人视觉语言模型中的空间轨迹推理精要
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
December 15, 2025
作者: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
cs.AI
摘要
空间追踪作为机器人的基础具身交互能力,其实现具有内在挑战性,因为它需要结合复杂空间指代与真实世界度量测量的多步骤度量推理。然而,现有方法难以应对这种组合式任务。为此,我们提出RoboTracer——一种三维感知的视觉语言模型,首次通过统一空间编码器和回归监督解码器,在监督微调阶段同步实现三维空间指代与测量,增强模型的尺度感知能力。此外,RoboTracer通过引入度量敏感的过程奖励进行强化微调,监督关键中间感知线索以精准生成空间轨迹,从而推进多步骤度量推理。为支持监督微调与强化微调训练,我们构建了包含3000万问答对的大规模数据集TraceSpatial,涵盖室外/室内/桌面场景,支持最多9步的复杂推理流程。我们进一步提出填补评估空白的挑战性基准TraceSpatial-Bench。实验结果表明,RoboTracer在空间理解、测量与指代方面均超越基线模型,平均成功率达79.1%,并在TraceSpatial-Bench上以36%的准确率优势大幅超越Gemini-2.5-Pro,实现性能突破。值得注意的是,RoboTracer可适配多种控制策略,在杂乱真实场景中跨机器人平台(UR5、G1人形机器人)执行长周期动态任务。
English
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.