SpatialBench: 당신의 공간 기반 모델은 올라운드 플레이어인가?

초록

공간 기반 모델(기초 모델)이 표준 데이터셋에서 인상적인 성능을 입증했지만, 여전히 중요한 질문이 남아 있다. 즉, 이 모델들이 다양한 하위 작업, 임의 시점, 변화하는 장면 도메인, 다양한 입력 밀도, 특정 하드웨어 제약 조건에 걸쳐 강건하게 일반화할 수 있는 진정한 만능 플레이어인가이다. 이 포괄적인 질문에 답하기 위해서는 전체적인 평가가 필요하지만, 현재 모델들은 주로 특정 도메인을 위해 설계되거나 학습되어 해당 도메인에서만 평가되고 있다. 이러한 평가는 본질적으로 좁은 패러다임 범위, 제한된 장면 도메인, 임의 프레임 샘플링에 의해 제한되므로, 진정한 일반화 능력을 평가하기가 근본적으로 어렵다. 이러한 격차를 해소하기 위해, 우리는 결정론적 샘플링을 적용한 교차 패러다임 및 도메인 다양성을 갖춘 공간 기반 모델용 벤치마크인 SpatialBench를 제안한다. SpatialBench는 전례 없는 규모와 엄격한 결정론적 설계를 특징으로 하며, 5개의 다양한 공간 도메인에 걸쳐 총 19개 데이터셋과 546개 장면을 포함한다. 또한 6개 패러다임의 41개 모델을 대상으로 4가지 서로 다른 입력 밀도 설정 하에서 5개 작업군을 포괄적으로 평가한다. 광범위한 평가 결과, 현재 모델은 아직 만능 플레이어가 아니며, 향후 발전을 위한 중요한 통찰력을 제공한다. 특히, 전-문맥 주의집중이 정확도를 극대화하는 반면, 제한된 메모리 전략은 장기 시퀀스 확장성을 가능하게 함을 보여준다. 또한, 까다로운 구현 및 자기중심적 작업에 대한 실증 평가를 통해 엄격한 도메인 정렬과 높은 데이터 품질이 단순한 데이터셋 규모 확장보다 성능에 훨씬 더 중요함을 입증한다. 마지막으로, 분석에서 확인된 가장 큰 데이터 격차를 해소하기 위해 평가를 넘어 대규모 데이터셋인 DA-Next-5M과 강력한 기준 모델인 DA-Next를 도입하여 공간 표현 학습의 경계를 확장한다.

English

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.