借助多模态基础模型扩展空间智能
Scaling Spatial Intelligence with Multimodal Foundation Models
November 17, 2025
作者: Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
cs.AI
摘要
尽管取得了显著进展,多模态基础模型在空间智能方面仍存在明显不足。本研究通过扩展多模态基础模型规模,在SenseNova-SI系列中培育空间智能能力。该系列基于成熟的多模态基础构建,包括视觉理解模型(如Qwen3-VL和InternVL3)以及统一理解与生成模型(如Bagel)。我们采用系统化方法构建高性能、强鲁棒性的空间智能:通过严格的空间能力分类体系,精心策划了包含800万多样本数据的SenseNova-SI-8M数据集。SenseNova-SI在多项空间智能基准测试中展现出突破性性能:VSI-Bench达68.7%,MMSI达43.3%,MindCube达85.6%,ViewSpatial达54.6%,SITE达50.1%,同时保持强大的通用多模态理解能力(如MMBench-En达84.9%)。更重要的是,我们分析了数据规模扩展的影响,探讨了多样化数据训练带来的泛化能力早期迹象,解析了过拟合与语言捷径风险,开展了空间思维链推理的初步研究,并验证了潜在的下游应用前景。SenseNova-SI为持续演进项目,本报告将定期更新。所有新训练的多模态基础模型均已开源,以促进该领域的深入研究。
English
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.