공간 역량 벤치마크

초록

공간 능력은 환경에 대한 일관된 내부 표상을 유지하고 이를 통해 이산적 구조를 추론하며 제약 조건 하에서 행동을 계획하는 능력을 말한다. 대규모 모델에 대한 기존 공간 평가는 3D 변환이나 시각 질의응답을 통해 개별 기본 요소를 탐색하는 데 제한되어 있다. 본 연구에서는 세 가지 계층적 능력 범주를 아우르는 공간 능력 벤치마크(SCBench)를 제안하며, 해당 과제들은 결정론적 검증기 또는 시뮬레이터 기반 평가자를 통해 실행 가능한 출력을 요구한다. SCBench에서 세 가지 최첨단 모델은 능력 수준이 높아질수록 단조 감속하는 정확도를 보인다. 출력 토큰 제한을 광범위하게 조절한 결과, 정확도 향상은 낮은 예산 구간에 집중되며 빠르게 포화되는 한편, 실패 사례는 전역 제약을 위반하는 국소적으로 타당한 기하학적 표현이 주된 원인으로 나타났다. 과제 생성기, 검증 도구 및 시각화 도구를 공개한다.

English

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.