SSR:通過理性引導的空間推理增強視覺語言模型的深度感知
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
May 18, 2025
作者: Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang
cs.AI
摘要
尽管视觉语言模型(VLMs)在多模态任务中取得了显著进展,但其对RGB输入的依赖限制了精确的空间理解。现有的整合空间线索(如点云或深度)的方法,要么需要专门的传感器,要么未能有效利用深度信息进行高阶推理。为此,我们提出了一种新颖的空间感知与推理方法,称为SSR,该框架将原始深度数据转化为结构化、可解释的文本推理依据。这些文本推理依据作为有意义的中间表示,显著增强了空间推理能力。此外,我们利用知识蒸馏将生成的推理依据压缩为紧凑的潜在嵌入,便于资源高效且即插即用地集成到现有VLMs中,而无需重新训练。为了进行全面评估,我们引入了一个名为SSR-CoT的新数据集,这是一个包含中间空间推理注释的百万级视觉语言推理数据集,并提出了SSRBench,一个全面的多任务基准。在多个基准上的广泛实验表明,SSR显著提高了深度利用并增强了空间推理,从而推动VLMs向更类人的多模态理解迈进。我们的项目页面位于https://yliu-cs.github.io/SSR。
English
Despite impressive advancements in Visual-Language Models (VLMs) for
multi-modal tasks, their reliance on RGB inputs limits precise spatial
understanding. Existing methods for integrating spatial cues, such as point
clouds or depth, either require specialized sensors or fail to effectively
exploit depth information for higher-order reasoning. To this end, we propose a
novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that
transforms raw depth data into structured, interpretable textual rationales.
These textual rationales serve as meaningful intermediate representations to
significantly enhance spatial reasoning capabilities. Additionally, we leverage
knowledge distillation to compress the generated rationales into compact latent
embeddings, which facilitate resource-efficient and plug-and-play integration
into existing VLMs without retraining. To enable comprehensive evaluation, we
introduce a new dataset named SSR-CoT, a million-scale visual-language
reasoning dataset enriched with intermediate spatial reasoning annotations, and
present SSRBench, a comprehensive multi-task benchmark. Extensive experiments
on multiple benchmarks demonstrate SSR substantially improves depth utilization
and enhances spatial reasoning, thereby advancing VLMs toward more human-like
multi-modal understanding. Our project page is at
https://yliu-cs.github.io/SSR.Summary
AI-Generated Summary