强化空间视觉语言模型中的双路径推理

摘要

空间视觉语言模型在几何感知方面取得了显著进展，但涉及深度、距离和场景关系的多步推理仍具挑战。此外，不同类型的空间查询需要根本不同的策略：有些最适合通过纯语言逐步演绎，而另一些则需先进行显式三维定位再进行定量推理。我们提出基于强化学习的空间视觉语言模型双路径空间推理框架（SR-REAL），该统一框架为空间视觉语言模型配备两条互补推理路径：纯语言推理路径（LOR），执行逐步语言演绎；以及检测-再推理路径（DTR），通过区域令牌检测三维几何线索（如中心点或包围框），再进行显式几何推理。SR-REAL首先通过冷启动监督微调阶段构建LOR和DTR的思维链监督，并暴露区域到三维接口；随后采用强化学习，通过准确率和格式奖励优化策略模型；对于DTR，离散中心检测奖励进一步细化几何对齐。在多个空间基准测试中，SR-REAL显著超越空间视觉语言模型基线：（i）单个强化学习训练模型支持两条推理路径，DTR通过精确三维定位在区域感知任务中表现优异，LOR则增强通用空间推理；（ii）联合训练两条路径促进相互强化；（iii）高质量混合冷启动数据对稳定强化学习优化至关重要；（iv）模型无需逐任务调整即可跨数据集和领域泛化，展现LOR与DTR之间的正向迁移能力。

English

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.