ReVSI：重构视觉空间智能评估体系，精准测评VLM三维推理能力

摘要

当前对空间智能的评估在现代视觉语言模型（VLM）场景下可能存在系统性失效。首先，许多基准测试的问题-答案对源自基于点云的三维标注数据，这些数据最初是为传统三维感知任务构建的。当此类标注被直接用作视频评估的基准时，重建与标注过程中产生的伪影可能导致明显可见的物体被遗漏、物体身份误标，或使依赖几何的答案（如尺寸）失真，从而产生错误或模糊的问答对。其次，评估通常假设模型可访问完整场景，而多数VLM实际仅处理稀疏采样的帧序列（如16-64帧），这使得许多问题在模型真实输入条件下实际上无法解答。为提升评估有效性，我们提出ReVSI基准与协议，通过确保每个问答对在模型实际输入条件下可解答且答案正确来改进评估。具体而言，我们使用专业三维标注工具对5个数据集的381个场景重新进行物体与几何标注以提升数据质量，并通过严格的偏差消减和人工验证重新生成所有问答对。此外，我们通过提供多帧预算（16/32/64/全帧）变体和细粒度物体可见性元数据来增强评估可控性，支持受控的诊断分析。基于ReVSI对通用及领域专用VLM的评估揭示了传统基准所掩盖的系统性失效模式，从而为空间智能提供了更可靠且具备诊断能力的评估方案。

English

Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.