공간 향상: 언어 기반 추론을 통한 시각적 표현 향상

초록

대규모 사전 학습 이미지 표현 모델(즉, 비전 인코더)이 다양한 비전 과제에서 놀라운 성공을 거두었음에도 불구하고, 이들은 주로 2D 이미지 데이터로 학습되어 실제 세계의 객체와 배경 간 3D 공간 관계를 제대로 포착하지 못하는 경우가 많으며, 이로 인해 많은 다운스트림 애플리케이션에서의 효과가 제한된다. 이를 해결하기 위해, 우리는 기존 사전 학습된 비전 인코더의 공간 인식 능력을 언어적 설명으로 표현된 3D 공간 지식을 주입하여 향상시키는 확장 가능한 프레임워크인 SpatialBoost를 제안한다. 핵심 아이디어는 2D 이미지에서 추출된 조밀한(dense) 3D 공간 정보를 언어적 표현으로 변환한 후, 이를 대규모 언어 모델(LLM)을 통해 비전 인코더에 주입하는 것이다. 이를 위해, 우리는 점진적으로 조밀한 공간 지음을 통합하고 계층적인 공간 이해를 구축하는 다중 턴 사고 연쇄(Chain-of-Thought, CoT) 추론 과정을 채택한다. 효과를 검증하기 위해, SpatialBoost를 DINOv3와 같은 최첨단 비전 인코더에 적용하고, 3D 인식과 일반 비전 능력이 모두 필요한 다양한 벤치마크에서의 성능 향상을 평가한다. 예를 들어, SpatialBoost는 ADE20K에서 DINOv3 성능을 55.9 mIoU에서 59.7 mIoU로 향상시켜 사전 학습된 DINOv3 대비 3.8%의 성능 향상과 함께 최첨단 성능을 달성한다.

English

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

공간 향상: 언어 기반 추론을 통한 시각적 표현 향상

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

초록

Support