ChatPaper.aiChatPaper

SpatialScore:迈向多模态空间理解的统一评估框架

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

May 22, 2025
作者: Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
cs.AI

摘要

多模态大语言模型(MLLMs)在问答任务中取得了显著成就,然而其在空间理解方面的能力尚待深入探索。本研究探讨了一个关键问题:现有的MLLMs是否具备三维空间感知与理解能力?具体而言,本文做出了以下贡献:(i) 我们引入了VGBench,这是一个专门设计用于评估MLLMs视觉几何感知能力的基准,例如相机姿态与运动估计;(ii) 我们提出了SpatialScore,迄今为止最全面、最多样化的多模态空间理解基准,它整合了VGBench及其他11个现有数据集的相关数据。该基准涵盖了28,000个样本,涉及多种空间理解任务、模态及问答形式,并精心挑选了一个具有挑战性的子集——SpatialScore-Hard;(iii) 我们开发了SpatialAgent,一个创新的多代理系统,集成了9种专门用于空间理解的工具,支持Plan-Execute和ReAct两种推理范式;(iv) 我们进行了广泛的评估,揭示了空间推理中持续存在的挑战,同时证明了SpatialAgent的有效性。我们相信,SpatialScore将为下一代MLLMs的演进提供宝贵洞见,并作为一个严格的基准发挥重要作用。
English
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.

Summary

AI-Generated Summary

PDF102May 23, 2025