ChatPaper.aiChatPaper

視覺空間調諧

Visual Spatial Tuning

November 7, 2025
作者: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
cs.AI

摘要

從視覺輸入中捕捉空間關係是實現類人通用智能的基石。過往研究多嘗試通過添加專用編碼器來增強視覺語言模型的空間感知能力,但這種方法會引入額外開銷且通常損害通用性能。為在通用架構中提升空間能力,我們提出視覺空間調優框架——一套培育視覺語言模型從空間感知到推理的類人視覺空間能力的綜合方案。我們首先通過構建大規模數據集VST-P(包含涵蓋單視角、多圖像和視頻的19項技能共410萬樣本)來增強模型的空間感知能力。隨後推出VST-R數據集(含13.5萬精選樣本),指導模型進行空間推理。特別採用漸進式訓練流程:先通過監督微調建立基礎空間知識,再通過強化學習提升空間推理能力。在保持通用性能不受影響的前提下,VST框架在多個空間基準測試中持續取得突破性成果(MMSI-Bench達34.8%,VSIBench達61.2%)。研究表明,視覺語言行動模型可通過此空間調優範式顯著增強,為構建更具物理實體基礎的人工智能開闢新路徑。
English
Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including 34.8% on MMSI-Bench and 61.2% on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
PDF492December 2, 2025