空间能力基准测试

摘要

空间能力是指在环境中保持一致的内部表征，并利用该表征推断离散结构、在约束条件下规划行动的特质。现有针对大模型的空间评估多局限于通过三维变换或视觉问答来探测孤立要素。我们提出空间能力基准测试（SCBench），涵盖三个层级化的能力维度，其任务需通过确定性检查器或基于模拟器的评估器验证可执行输出。在SCBench测试中，三个前沿模型随着能力层级的提升呈现出单调递减的准确率。扫掠输出标记上限实验表明，准确率提升集中在低标记预算区间且快速饱和，而失败案例主要表现为符合局部几何逻辑却违反全局约束的情形。我们同步发布了任务生成器、验证器及可视化工具集。

English

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.