ChatPaper.aiChatPaper

空间增强:通过语言引导推理提升视觉表征能力

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

March 23, 2026
作者: Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin
cs.AI

摘要

尽管大规模预训练图像表征模型(即视觉编码器)在各种视觉任务中取得了显著成功,但这些模型主要基于二维图像数据训练,因此往往难以捕捉现实世界中物体与背景之间的三维空间关系,这限制了许多下游应用中的效能。为解决这一问题,我们提出SpatialBoost——一个可扩展的框架,通过注入语言描述表达的三维空间知识来增强现有预训练视觉编码器的空间感知能力。其核心思想是将二维图像中的密集三维空间信息转化为语言表达,进而通过大语言模型(LLM)将此类空间知识注入视觉编码器。为此,我们采用多轮思维链(CoT)推理过程,逐步融合密集空间知识并构建层次化的空间理解。为验证有效性,我们将SpatialBoost适配至DINOv3等前沿视觉编码器,并在需要三维感知与通用视觉能力的大规模基准测试中评估其性能提升。例如在ADE20K数据集上,SpatialBoost将DINOv3的mIoU从55.9提升至59.7,以3.8%的性能增益超越预训练DINOv3,达到当前最优水平。
English
Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.
PDF402March 25, 2026