强化空间视觉语言模型中的双路径推理
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
June 16, 2026
作者: Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu
cs.AI
摘要
空间视觉语言模型在几何感知方面取得了显著进展,但涉及深度、距离和场景关系的多步推理仍具挑战。此外,不同类型的空间查询需要根本不同的策略:有些最适合通过纯语言逐步演绎,而另一些则需先进行显式三维定位再进行定量推理。我们提出基于强化学习的空间视觉语言模型双路径空间推理框架(SR-REAL),该统一框架为空间视觉语言模型配备两条互补推理路径:纯语言推理路径(LOR),执行逐步语言演绎;以及检测-再推理路径(DTR),通过区域令牌检测三维几何线索(如中心点或包围框),再进行显式几何推理。SR-REAL首先通过冷启动监督微调阶段构建LOR和DTR的思维链监督,并暴露区域到三维接口;随后采用强化学习,通过准确率和格式奖励优化策略模型;对于DTR,离散中心检测奖励进一步细化几何对齐。在多个空间基准测试中,SR-REAL显著超越空间视觉语言模型基线:(i)单个强化学习训练模型支持两条推理路径,DTR通过精确三维定位在区域感知任务中表现优异,LOR则增强通用空间推理;(ii)联合训练两条路径促进相互强化;(iii)高质量混合冷启动数据对稳定强化学习优化至关重要;(iv)模型无需逐任务调整即可跨数据集和领域泛化,展现LOR与DTR之间的正向迁移能力。
English
Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.