SpatialBoost: 言語誘導型推論による視覚表現の強化

要旨

大規模事前学習画像表現モデル（視覚エンコーダ）は様々な視覚タスクで顕著な成功を収めているものの、これらのモデルは主に2D画像データで学習されているため、現実世界における物体と背景の3D空間的関係を十分に捉えられず、多くの下流アプリケーションでの効果が制限されている。この問題に対処するため、我々は言語記述で表現された3D空間知識を注入することで、既存の事前学習視覚エンコーダの空間認識能力を強化するスケーラブルなフレームワーク「SpatialBoost」を提案する。中核となるアイデアは、2D画像から得られる高密度な3D空間情報を言語表現に変換し、大規模言語モデル（LLM）を介してその空間知識を視覚エンコーダに注入するというものである。この目的のために、多段階の連鎖思考（Chain-of-Thought）推論プロセスを採用し、高密度な空間知識を段階的に取り込み、階層的な空間理解を構築する。有効性を検証するため、DINOv3などの最先端視覚エンコーダにSpatialBoostを適用し、3D知覚と一般的な視覚能力の両方を必要とする広範なベンチマークで性能向上を評価した。例えばSpatialBoostは、ADE20KにおけるDINOv3の性能を55.9 mIoUから59.7 mIoUに向上させ、事前学習済みDINOv3比3.8%の性能向上で最先端の性能を達成した。

English

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

SpatialBoost: 言語誘導型推論による視覚表現の強化

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

要旨

Support