SpatialBench: あなたの空間基盤モデルは万能選手ですか？

要旨

空間基盤モデルは標準データセットにおいて顕著な性能を示している一方で、重要な疑問が残る。それは、多様な下流タスク、任意の視点、変化するシーンドメイン、可変の入力密度、特定のハードウェア制約に対してロバストに汎化できる真のオールラウンドプレイヤーなのか、という点である。この包括的な問いに答えるには全体的な評価が必要であるが、現在のモデルは主に、それらが特別に設計・訓練された特定のドメインで評価されている。こうした評価は、本質的に狭いパラダイム範囲、限られたシーンドメイン、任意のフレームサンプリングによって制限されており、真の汎化能力を評価することは根本的に困難である。このギャップを解消するため、我々は決定論的サンプリングを備えたクロスパラダイムかつドメイン多様な空間基盤モデル用ベンチマーク「SpatialBench」を提案する。SpatialBenchは前例のない規模と厳密な決定論的設計を特徴とし、5つの多様な空間ドメインにわたる19のデータセットと546のシーンを含む。また、6つのパラダイムにわたる41のモデルを、4つの異なる入力密度設定下で5つのタスクスイートについて包括的に評価する。我々の広範な評価により、現在のモデルはまだオールラウンドプレイヤーではないことが明らかになり、将来の進展に向けた重要な知見が得られた。具体的には、フルコンテキストアテンションが精度を最大化する一方、有界メモリ戦略が長シーケンスのスケーラビリティを実現することを実証した。さらに、挑戦的な身体性・自己中心性タスクにおける実証評価から、厳密なドメイン整合性と高品質なデータが、単純なデータセットのスケーリングよりも性能に決定的に重要であることが示された。そして、分析で特定された最大のデータギャップに対処するため、評価を超えて大規模データセット「DA-Next-5M」と強力なベースラインモデル「DA-Next」を導入し、空間表現学習の限界を押し広げる。

English

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.