Cambrian-S:迈向视频空间超感知的新纪元
Cambrian-S: Towards Spatial Supersensing in Video
November 6, 2025
作者: Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie
cs.AI
摘要
我们认为,真正多模态智能的发展需要从被动响应、任务驱动的系统以及蛮力长上下文处理转向更广泛的超感知范式。我们将空间超感知定义为超越纯语言理解的四个阶段:语义感知(识别所见之物)、流式事件认知(在连续体验中维持记忆)、隐性三维空间认知(推断像素背后的世界)以及预测性世界建模(建立过滤和组织信息的内部模型)。当前基准测试大多仅检验初级阶段,对空间认知的覆盖范围狭窄,且很少以需要真正世界建模的方式挑战模型。为推动空间超感知的发展,我们提出由两部分组成的VSI-SUPER基准测试:VSR(长时程视觉空间回忆)和VSC(持续视觉空间计数)。这些任务需要任意长度的视频输入,却能有效抵抗蛮力上下文扩展。我们通过构建VSI-590K数据集并训练Cambrian-S模型,在VSI基准测试上实现了30%的绝对性能提升且未牺牲通用能力。然而模型在VSI-SUPER上的表现仍存在局限,表明仅靠规模扩展无法实现空间超感知。我们提出预测性感知作为发展方向,并通过概念验证展示了一种自监督的潜在帧预测器如何利用预测误差驱动记忆与事件分割。在VSI-SUPER测试中,该方法显著优于主流专有基线,证明空间超感知需要模型不仅能观察,更要能预测、筛选并组织经验。
English
We argue that progress in true multimodal intelligence calls for a shift from
reactive, task-driven systems and brute-force long context towards a broader
paradigm of supersensing. We frame spatial supersensing as four stages beyond
linguistic-only understanding: semantic perception (naming what is seen),
streaming event cognition (maintaining memory across continuous experiences),
implicit 3D spatial cognition (inferring the world behind pixels), and
predictive world modeling (creating internal models that filter and organize
information). Current benchmarks largely test only the early stages, offering
narrow coverage of spatial cognition and rarely challenging models in ways that
require true world modeling. To drive progress in spatial supersensing, we
present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial
recall) and VSC (continual visual spatial counting). These tasks require
arbitrarily long video inputs yet are resistant to brute-force context
expansion. We then test data scaling limits by curating VSI-590K and training
Cambrian-S, achieving +30% absolute improvement on VSI-Bench without
sacrificing general capabilities. Yet performance on VSI-SUPER remains limited,
indicating that scale alone is insufficient for spatial supersensing. We
propose predictive sensing as a path forward, presenting a proof-of-concept in
which a self-supervised next-latent-frame predictor leverages surprise
(prediction error) to drive memory and event segmentation. On VSI-SUPER, this
approach substantially outperforms leading proprietary baselines, showing that
spatial supersensing requires models that not only see but also anticipate,
select, and organize experience.