视觉空间调优
Visual Spatial Tuning
November 7, 2025
作者: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
cs.AI
摘要
从视觉输入中捕捉空间关系是实现类人通用智能的基石。先前研究多通过引入额外专家编码器来增强视觉语言模型的空间感知能力,但这会带来额外开销且往往损害通用性能。为提升通用架构的空间能力,我们提出视觉空间调优框架,通过从空间感知到推理的完整训练流程培育具有类人视觉空间能力的模型。我们首先构建包含410万样本的大规模数据集VST-P,涵盖单视图、多图像和视频三大类共19项空间技能,以增强模型的空间感知基础。随后推出包含13.5万样本的VST-R数据集,指导模型进行空间推理。特别采用渐进式训练流程:先通过监督微调建立空间知识基础,再通过强化学习提升空间推理能力。该方法在保持通用性能的前提下,在多个空间基准测试中取得领先成果,包括MMSI-Bench的34.8%和VSIBench的61.2%。研究表明,所提出的空间调优范式可显著增强视觉-语言-动作模型,为构建更具物理基础的人工智能开辟新路径。
English
Capturing spatial relationships from visual inputs is a cornerstone of
human-like general intelligence. Several previous studies have tried to enhance
the spatial awareness of Vision-Language Models (VLMs) by adding extra expert
encoders, which brings extra overhead and usually harms general capabilities.
To enhance the spatial ability in general architectures, we introduce Visual
Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with
human-like visuospatial abilities, from spatial perception to reasoning. We
first attempt to enhance spatial perception in VLMs by constructing a
large-scale dataset termed VST-P, which comprises 4.1 million samples spanning
19 skills across single views, multiple images, and videos. Then, we present
VST-R, a curated dataset with 135K samples that instruct models to reason in
space. In particular, we adopt a progressive training pipeline: supervised
fine-tuning to build foundational spatial knowledge, followed by reinforcement
learning to further improve spatial reasoning abilities. Without the
side-effect to general capabilities, the proposed VST consistently achieves
state-of-the-art results on several spatial benchmarks, including 34.8% on
MMSI-Bench and 61.2% on VSIBench. It turns out that the
Vision-Language-Action models can be significantly enhanced with the proposed
spatial tuning paradigm, paving the way for more physically grounded AI.