寒武紀-S:邁向影片空間超感知之路
Cambrian-S: Towards Spatial Supersensing in Video
November 6, 2025
作者: Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie
cs.AI
摘要
我們主張,真正多模態智能的進展需要從被動的、任務驅動的系統與暴力長上下文處理,轉向更廣泛的超感知範式。我們將空間超感知定義為超越純語言理解的四個階段:語義感知(識別所見之物)、流式事件認知(在連續體驗中維持記憶)、隱性3D空間認知(推斷像素背後的物理世界)以及預測性世界建模(建立過濾與組織信息的內部模型)。現有基準大多僅測試初級階段,對空間認知的覆蓋範圍狹窄,且鮮少以需要真實世界建模的方式挑戰模型。為推動空間超感知發展,我們提出VSI-SUPER雙部分基準:VSR(長時程視覺空間回憶)與VSC(連續視覺空間計數)。這些任務需處理任意長度的影片輸入,卻能抵抗暴力上下文擴展。我們進一步通過構建VSI-590K數據集並訓練Cambrian-S模型測試數據擴展極限,在VSI-Bench上實現絕對性能提升30%且不損害通用能力。然而模型在VSI-SUPER上的表現仍受限,表明僅靠規模擴展不足以實現空間超感知。我們提出「預測性感知」作為發展路徑,並展示概念驗證:通過自監督的潛在幀預測器利用驚奇值(預測誤差)驅動記憶與事件分割。該方法在VSI-SUPER上顯著超越主流專有基線模型,證明空間超感知需構建不僅能「看見」、更能預期、篩選並組織經驗的模型。
English
We argue that progress in true multimodal intelligence calls for a shift from
reactive, task-driven systems and brute-force long context towards a broader
paradigm of supersensing. We frame spatial supersensing as four stages beyond
linguistic-only understanding: semantic perception (naming what is seen),
streaming event cognition (maintaining memory across continuous experiences),
implicit 3D spatial cognition (inferring the world behind pixels), and
predictive world modeling (creating internal models that filter and organize
information). Current benchmarks largely test only the early stages, offering
narrow coverage of spatial cognition and rarely challenging models in ways that
require true world modeling. To drive progress in spatial supersensing, we
present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial
recall) and VSC (continual visual spatial counting). These tasks require
arbitrarily long video inputs yet are resistant to brute-force context
expansion. We then test data scaling limits by curating VSI-590K and training
Cambrian-S, achieving +30% absolute improvement on VSI-Bench without
sacrificing general capabilities. Yet performance on VSI-SUPER remains limited,
indicating that scale alone is insufficient for spatial supersensing. We
propose predictive sensing as a path forward, presenting a proof-of-concept in
which a self-supervised next-latent-frame predictor leverages surprise
(prediction error) to drive memory and event segmentation. On VSI-SUPER, this
approach substantially outperforms leading proprietary baselines, showing that
spatial supersensing requires models that not only see but also anticipate,
select, and organize experience.