MMPerspective: MLLMは視点を理解できるか？視点認識、推論、ロバスト性の包括的ベンチマーク

要旨

視点の理解は人間の視覚知覚において基本的な要素であるが、マルチモーダル大規模言語モデル（MLLM）が視点幾何学をどの程度内在化しているかは未だ明らかではない。本研究では、MMPerspectiveを初めて導入し、視点の理解を体系的に評価するために、3つの補完的な次元（視点知覚、推論、ロバストネス）にわたる10の注意深く設計されたタスクを通じてMLLMの能力を検証する。このベンチマークは、消失点の知覚や計数、視点タイプの推論、3D空間における線の関係理解、視点保存変換に対する不変性などの主要な能力を探る2,711の実世界および合成画像インスタンスと5,083の質問-回答ペアで構成されている。43の最先端MLLMに対する包括的な評価を通じて、重要な限界が明らかになった：モデルは表面的な知覚タスクでは有能であるが、合成的推論や摂動下での空間的一貫性の維持に苦戦している。さらに、モデルアーキテクチャ、スケール、視点能力の間の興味深いパターンを分析し、ロバストネスのボトルネックとチェーン・オブ・ソートプロンプティングの利点を強調した。MMPerspectiveは、視覚言語システムにおける空間理解の診断と進展のための貴重なテストベッドを確立する。リソースは以下で利用可能：https://yunlong10.github.io/MMPerspective/

English

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

MMPerspective: MLLMは視点を理解できるか？視点認識、推論、ロバスト性の包括的ベンチマーク

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

要旨

Support