ChatPaper.aiChatPaper

空间工具:通过双重交互式强化学习实现工具增强的空间推理

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

December 3, 2025
作者: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay
cs.AI

摘要

视觉语言模型(VLMs)在定性视觉理解方面表现出色,但在具身应用所需的精确空间推理能力上仍存在不足。智能体范式表明,VLMs可通过调用多种工具(如深度估计器、分割模型和姿态估计器)来增强这些能力。然而,如何在不依赖人工预设提示策略或固定工具流水线的前提下实现这一愿景仍是开放挑战——现有方法会限制VLM发现最优工具使用模式的能力。强化学习虽能弥补这一差距,但由于多工具推理的搜索空间过大,目前仅局限于单一视觉工具的应用。我们提出双重交互式强化学习(DIRL),该训练框架通过交互探索与反馈机制,分两阶段让VLM学习协调多工具:在教学阶段,我们将单一工具专家通过交互式强化学习获得的示范数据,与前沿模型使用全工具链的轨迹相结合;在探索阶段,模型通过持续强化学习进一步优化多工具协作能力。我们的SpaceTools模型具备工具增强的空间推理能力,在空间理解基准测试(RoboSpatial-Home、BLINK、BOP-ASK)中达到最先进性能,并借助七自由度机器人作为工具实现了可靠的实际操作。DIRL相较于标准SFT(在RoboSpatial上提升12%)和强化学习基线(在RoboSpatial上提升16%)实现显著进步。项目页面:https://spacetools.github.io/。
English
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
PDF181December 5, 2025