ChatPaper.aiChatPaper

ViewSpatial-Bench:评估视觉-语言模型中的多视角空间定位能力

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

May 27, 2025
作者: Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
cs.AI

摘要

视觉-语言模型(VLMs)在理解和推理视觉内容方面展现了卓越的能力,但在需要跨视角理解和空间推理的任务中仍存在显著挑战。我们识别出一个关键局限:当前VLMs主要擅长以自我为中心的空间推理(从摄像机的视角出发),但在需要采用其他实体的空间参照系时,难以泛化到以他者为中心的视角。为此,我们引入了ViewSpatial-Bench,这是首个专为多视角空间定位识别评估设计的综合基准,涵盖五种不同的任务类型,并辅以一个自动化的3D标注流程,生成精确的方向标签。在ViewSpatial-Bench上对多种VLMs进行全面评估后,发现了一个显著的性能差距:模型在摄像机视角任务上表现尚可,但在从人类视角进行推理时准确性下降。通过在多视角空间数据集上对VLMs进行微调,我们在各项任务中实现了46.24%的整体性能提升,凸显了该方法的有效性。我们的工作为具身AI系统的空间智能确立了一个重要基准,并提供了实证证据,表明建模3D空间关系能够增强VLMs相应的空间理解能力。
English
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.

Summary

AI-Generated Summary

PDF102May 28, 2025