SSR：根拠に基づく空間推論による視覚言語モデルの深度知覚の強化

要旨

視覚言語モデル（VLMs）の多モーダルタスクにおける目覚ましい進展にもかかわらず、RGB入力への依存が空間理解の精度を制限している。点群や深度などの空間的手がかりを統合する既存の手法は、特殊なセンサーを必要とするか、高次推論のための深度情報を効果的に活用できない。そこで、我々は新たな空間感覚と推論手法「SSR」を提案する。このフレームワークは、生の深度データを構造化された解釈可能なテキスト的根拠に変換する。これらのテキスト的根拠は、空間推論能力を大幅に向上させる意味ある中間表現として機能する。さらに、生成された根拠をコンパクトな潜在埋め込みに圧縮するために知識蒸留を活用し、再学習なしで既存のVLMsにリソース効率的かつプラグアンドプレイで統合することを可能にする。包括的な評価を可能にするため、中間的な空間推論アノテーションを豊富に含む百万規模の視覚言語推論データセット「SSR-CoT」を導入し、多タスクベンチマーク「SSRBench」を提示する。複数のベンチマークでの広範な実験により、SSRが深度の利用を大幅に改善し、空間推論を強化することで、VLMsをより人間らしい多モーダル理解に近づけることが示された。プロジェクトページはhttps://yliu-cs.github.io/SSRにて公開されている。

English

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at https://yliu-cs.github.io/SSR.

SSR：根拠に基づく空間推論による視覚言語モデルの深度知覚の強化

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

要旨

Support