ChatPaper.aiChatPaper

強化空間視覺語言模型中的雙路徑推理

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

June 16, 2026
作者: Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu
cs.AI

摘要

空間視覺語言模型在幾何感知方面取得了顯著進展,然而,涉及深度、距離與場景關係等多步驟推理的複雜空間推理任務仍然具有挑戰性。此外,不同類型的空間查詢需要截然不同的處理策略:有些問題最適合透過純粹的語言步驟演繹來解決,而另一些則需要在進行量化推理之前先進行明確的3D幾何定位。我們提出了一種基於強化學習的雙路徑空間推理框架(SR-REAL),該框架為空間視覺語言模型配備了兩種互補的推理路徑:純語言推理路徑(LOR),透過逐步的語言演繹進行推理;以及先檢測後推理路徑(DTR),該路徑先透過區域標記檢測3D幾何線索(例如中心點或邊界框),再進行明確的幾何推理。SR-REAL首先進行冷啟動監督式微調階段,構建LOR與DTR的思維鏈監督,並暴露區域到3D的介面,隨後利用強化學習,以準確性和格式獎勵優化策略模型;針對DTR,一個基於離散中心點的檢測獎勵進一步細化幾何對齊。在多個空間基準測試中,SR-REAL顯著優於空間視覺語言模型基線:(i)單一經過強化學習訓練的模型支援兩種推理路徑,其中DTR透過精確的3D定位在區域感知任務中表現出色,而LOR則增強了通用空間推理能力;(ii)聯合訓練兩條路徑能夠促進相互增強;(iii)高品質、混合的冷啟動數據對於穩定的強化學習優化至關重要;(iv)該模型無需針對每個任務進行調整即可跨數據集與領域泛化,展現了LOR與DTR之間的正向遷移。
English
Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.