空间能力基准测试

摘要

空间能力是指在环境中保持一致的内部空间表征，并运用该表征推断离散结构、在约束条件下规划行动的特质。现有针对大模型的主流空间评估仅能通过三维变换或视觉问答来探测孤立的基础要素。我们提出空间能力基准测试（SCBench），涵盖三个层次化的能力维度，其任务要求生成可通过确定性检查器或基于模拟器的评估器验证的可执行输出。在SCBench测试中，三个前沿模型表现出随能力层级提升而单调递减的准确率。对输出标记上限的全面测试表明：准确率提升集中在低标记预算区间且快速饱和，而失败案例主要源于符合局部几何逻辑却违反全局约束的生成结果。我们同步发布了任务生成器、验证器及可视化工具集。

English

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.