從另一視角審視:評估多模態大語言模型中的多視角理解能力
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
April 21, 2025
作者: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma
cs.AI
摘要
多視角理解,即能夠整合不同視角的視覺信息以實現有效導航、操作和三維場景理解的能力,是將多模態大型語言模型(MLLMs)用作具身代理時面臨的根本挑戰。儘管近期的MLLMs在高層次推理和規劃方面展現了顯著進步,但在處理多視角幾何一致性和跨視角對應關係時,它們往往表現不佳。為了全面評估MLLMs在多視角場景推理中的挑戰,我們提出了All-Angles Bench,這是一個包含超過2,100個人類精心註釋的多視角問答對的基準測試,涵蓋90個多樣化的真實世界場景。我們的六項任務(計數、屬性識別、相對距離、相對方向、物體操作和相機姿態估計)專門測試模型的幾何對應能力以及跨視角信息一致對齊的能力。我們在包括Gemini-2.0-Flash、Claude-3.7-Sonnet和GPT-4o在內的27個代表性MLLMs上進行了廣泛實驗,並與人類評估者進行了對比,結果顯示出顯著的性能差距,表明當前的MLLMs遠未達到人類水平。通過深入分析,我們發現MLLMs在以下兩個方面表現尤為不足:(1)部分遮擋視角的跨視角對應關係;(2)建立粗略的相機姿態。這些發現凸顯了嵌入更強多視角意識的領域特定改進或模塊的必要性。我們相信,All-Angles Bench提供了寶貴的見解,並有助於縮小MLLMs與人類水平多視角理解之間的差距。該項目和基準測試已公開於https://danielchyeh.github.io/All-Angles-Bench/。
English
Multi-view understanding, the ability to reconcile visual information across
diverse viewpoints for effective navigation, manipulation, and 3D scene
comprehension, is a fundamental challenge in Multi-Modal Large Language Models
(MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive
advances in high-level reasoning and planning, they frequently fall short when
confronted with multi-view geometric consistency and cross-view correspondence.
To comprehensively evaluate the challenges of MLLMs in multi-view scene
reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human
carefully annotated multi-view question-answer pairs across 90 diverse
real-world scenes. Our six tasks (counting, attribute identification, relative
distance, relative direction, object manipulation, and camera pose estimation)
specifically test model's geometric correspondence and the capacity to align
information consistently across views. Our extensive experiments, benchmark on
27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and
GPT-4o against human evaluators reveals a substantial performance gap,
indicating that current MLLMs remain far from human-level proficiency. Through
in-depth analysis, we show that MLLMs are particularly underperforming under
two aspects: (1) cross-view correspondence for partially occluded views and (2)
establishing the coarse camera poses. These findings highlight the necessity of
domain-specific refinements or modules that embed stronger multi-view
awareness. We believe that our All-Angles Bench offers valuable insights and
contribute to bridging the gap between MLLMs and human-level multi-view
understanding. The project and benchmark are publicly available at
https://danielchyeh.github.io/All-Angles-Bench/.Summary
AI-Generated Summary