SSR：通过理性引导的空间推理增强视觉语言模型的深度感知能力

摘要

尽管视觉-语言模型（VLMs）在多模态任务中取得了显著进展，但其对RGB输入的依赖限制了精确的空间理解能力。现有整合空间线索（如点云或深度）的方法，要么需要专用传感器，要么未能有效利用深度信息进行高阶推理。为此，我们提出了一种新颖的空间感知与推理方法，称为SSR，该框架将原始深度数据转化为结构化、可解释的文本推理依据。这些文本推理依据作为有意义的中间表示，显著增强了空间推理能力。此外，我们利用知识蒸馏技术，将生成的推理依据压缩为紧凑的潜在嵌入，便于以资源高效且即插即用的方式集成到现有VLMs中，无需重新训练。为了进行全面评估，我们引入了一个名为SSR-CoT的新数据集，这是一个包含百万级视觉-语言推理任务的数据集，并附有中间空间推理注释，同时推出了SSRBench，一个综合的多任务基准测试。在多个基准上的广泛实验表明，SSR显著提升了深度信息的利用效率，增强了空间推理能力，从而推动VLMs向更接近人类的多模态理解迈进。我们的项目页面位于https://yliu-cs.github.io/SSR。

English

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at https://yliu-cs.github.io/SSR.

SSR：通过理性引导的空间推理增强视觉语言模型的深度感知能力

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

摘要

Support