空間能力ベンチマーク

要旨

空間コンピテンスとは、環境の一貫した内的表現を維持し、それを用いて離散的な構造を推論し、制約条件下で行動を計画する能力の質を指す。大規模モデルに対する従来の空間評価は、3D変換や視覚的質問応答による個別的な基本要素の検証に限られてきた。本研究では、実行可能な出力を決定論的チェッカーまたはシミュレータベースの評価器で検証する課題から構成される、3段階の能力階層にわたる空間コンピテンスベンチマーク（SCBench）を提案する。SCBenchにおいて、3つの先進モデルは能力階層が上がるにつれて精度が単調減少を示した。出力トークン上限値を系統的に変化させた実験では、精度向上は低リソース領域に集中し急速に飽和すること、また失敗の主要因が局所的には妥当だが大域的制約を破綻させる幾何学的表現であることが明らかになった。課題生成器、検証器、可視化ツールを公開する。

English

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.