SpaceTools:透過雙重互動式強化學習實現工具增強型空間推理
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
December 3, 2025
作者: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay
cs.AI
摘要
視覺語言模型(VLMs)在定性視覺理解方面表現出色,但在具身應用所需的度量精確空間推理方面仍存在困難。代理範式表明,VLMs可利用多種工具增強這些能力,例如深度估計器、分割模型和姿態估計器。然而,如何在不僅依賴手工提示策略或強制使用固定預定義工具管線(這會限制VLM發現最優工具使用模式的能力)的情況下實現這一願景,仍是開放性挑戰。強化學習或能克服這一差距,但由於多工具推理中的巨大搜索空間,目前僅限於使用單一視覺工具進行推理。我們提出雙交互式強化學習(DIRL),這是一個兩階段訓練框架,讓VLM通過交互探索與反饋學習協調多種工具。在教學階段,我們將通過交互式RL訓練的單一工具專家的示範與使用所有工具的前沿模型軌跡相結合。在探索階段,模型通過持續RL進一步優化多工具協調能力。我們的模型SpaceTools具備工具增強空間推理能力,在空間理解基準測試(RoboSpatial-Home、BLINK、BOP-ASK)中達到最先進性能,並通過使用7自由度機器人作為工具展示了可靠的實物操作能力。DIRL相較於原始SFT(在RoboSpatial上提升12%)和RL(在RoboSpatial上提升16%)基線有顯著改進。項目頁面:https://spacetools.github.io/。
English
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.