空間增強：透過語言引導推理提升視覺表徵能力

摘要

儘管大規模預訓練圖像表徵模型（即視覺編碼器）在各類視覺任務中取得了顯著成功，但由於其訓練數據主要基於二維圖像，這些模型往往難以捕捉現實世界中物體與背景之間的三維空間關係，從而限制了其在許多下游應用中的效能。為解決這一問題，我們提出SpatialBoost——一個可擴展的框架，通過注入以語言描述表達的三維空間知識來增強現有預訓練視覺編碼器的空間感知能力。其核心思想是將二維圖像中的密集三維空間信息轉化為語言表達，再透過大型語言模型將此類空間知識注入視覺編碼器。為實現這一目標，我們採用多輪思維鏈推理過程，逐步融合密集空間知識並建立層次化的空間理解。為驗證有效性，我們將SpatialBoost適配至DINOv3等先進視覺編碼器，並在需要三維感知與通用視覺能力的一系列基準測試中評估其效能提升。例如在ADE20K數據集上，SpatialBoost將DINOv3的表現從55.9 mIoU提升至59.7 mIoU，以3.8%的增益超越預訓練DINOv3並達到最先進性能。

English

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

空間增強：透過語言引導推理提升視覺表徵能力

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

摘要

Support