ChatPaper.aiChatPaper

基於多模態基礎模型的空間智能擴展

Scaling Spatial Intelligence with Multimodal Foundation Models

November 17, 2025
作者: Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
cs.AI

摘要

儘管取得了顯著進展,多模態基礎模型在空間智能方面仍存在明顯不足。本研究通過擴展多模態基礎模型規模,在SenseNova-SI系列中培育空間智能能力。該系列基於成熟的視覺理解模型(如Qwen3-VL和InternVL3)與統一理解生成模型(如Bagel),採用系統化方法構建了包含800萬個多樣化數據樣本的SenseNova-SI-8M數據集,並按嚴格的空間能力分類體系進行篩選。SenseNova-SI在廣泛的空間智能基準測試中展現出突破性性能:VSI-Bench達68.7%、MMSI達43.3%、MindCube達85.6%、ViewSpatial達54.6%、SITE達50.1%,同時保持強大的通用多模態理解能力(如MMBench-En達84.9%)。更重要的是,我們分析了數據規模化的影響,探討多樣化數據訓練引發的湧現泛化能力早期跡象,解析過擬合與語言捷徑風險,提出空間思維鏈推理的初步研究,並驗證下游應用潛力。SenseNova-SI為持續推進項目,本報告將持續更新。所有新訓練的多模態基礎模型均公開釋出,以促進該領域的深入研究。
English
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
PDF422December 1, 2025