ChatPaper.aiChatPaper

SpatialBench:您的空間基礎模型是否為全能選手?

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

May 26, 2026
作者: Haosong Peng, Hao Li, Jiaqi Chen, Yuhao Pan, Runmao Yao, Yalun Dai, Fushuo Huo, Fangzhou Hong, Zhaoxi Chen, Haozhao Wang, Dingwen Zhang, Ziwei Liu, Wenchao Xu
cs.AI

摘要

尽管空间基础模型在标准数据集上展现了令人瞩目的性能,但一个关键问题依然存在:它们是否真正堪称全能型选手,能够在多样化的下游任务、任意视角、场景域变化、不同输入密度以及特定硬件约束下实现稳健泛化?回答这一总体性问题需要全面的评估,然而现有模型主要针对其专门设计或训练的特定领域进行评估。这类评估本质上受限于范式覆盖狭窄、场景域有限以及帧采样随意,因此难以从根本上判断其真实的泛化能力。为弥补这一空白,我们提出了SpatialBench——一个面向空间基础模型的跨范式、领域多样化的基准测试,采用确定性采样方法。SpatialBench具有前所未有的规模与严谨的确定性设计,涵盖19个数据集、546个场景,跨越5个不同的空间领域。它全面评估了6种范式下的41个模型,在4种不同输入密度设置下,针对5个任务套件进行测试。我们的广泛评估揭示,当前模型尚未达到全能型选手的水平,并为未来发展提供了关键洞见。具体而言,我们证明了全上下文注意力机制能最大化精度,而有限内存策略则解锁了长序列的可扩展性。此外,我们在具身化和第一人称等挑战性任务上的实证评估表明,严格的领域对齐与高数据质量对性能的贡献远大于简单的数据集规模扩大。最后,针对分析中发现的最大数据缺口,我们不仅停留在评估层面,还引入了一个大规模数据集DA-Next-5M和一个强基线模型DA-Next,从而推动空间表示学习的前沿。
English
While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.