強化空間視覺語言模型中的雙路徑推理

摘要

空間視覺語言模型在幾何感知方面取得了顯著進展，然而，涉及深度、距離與場景關係等多步驟推理的複雜空間推理任務仍然具有挑戰性。此外，不同類型的空間查詢需要截然不同的處理策略：有些問題最適合透過純粹的語言步驟演繹來解決，而另一些則需要在進行量化推理之前先進行明確的3D幾何定位。我們提出了一種基於強化學習的雙路徑空間推理框架（SR-REAL），該框架為空間視覺語言模型配備了兩種互補的推理路徑：純語言推理路徑（LOR），透過逐步的語言演繹進行推理；以及先檢測後推理路徑（DTR），該路徑先透過區域標記檢測3D幾何線索（例如中心點或邊界框），再進行明確的幾何推理。SR-REAL首先進行冷啟動監督式微調階段，構建LOR與DTR的思維鏈監督，並暴露區域到3D的介面，隨後利用強化學習，以準確性和格式獎勵優化策略模型；針對DTR，一個基於離散中心點的檢測獎勵進一步細化幾何對齊。在多個空間基準測試中，SR-REAL顯著優於空間視覺語言模型基線：（i）單一經過強化學習訓練的模型支援兩種推理路徑，其中DTR透過精確的3D定位在區域感知任務中表現出色，而LOR則增強了通用空間推理能力；（ii）聯合訓練兩條路徑能夠促進相互增強；（iii）高品質、混合的冷啟動數據對於穩定的強化學習優化至關重要；（iv）該模型無需針對每個任務進行調整即可跨數據集與領域泛化，展現了LOR與DTR之間的正向遷移。

English

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.