MMPerspective：多模态大语言模型是否理解视角？一个全面的视角感知、推理与鲁棒性基准测试

摘要

理解透視是人類視覺感知的基礎，然而多模態大語言模型（MLLMs）在多大程度上內化了透視幾何仍不明確。我們提出了MMPerspective，這是首個專門設計來系統評估MLLMs透視理解的基準，通過三個互補維度（透視感知、推理與魯棒性）下的10項精心設計任務來實現。該基準包含2,711個真實世界與合成圖像實例，以及5,083個問答對，旨在探測關鍵能力，如消失點感知與計數、透視類型推理、三維空間中的線條關係理解、對保持透視變換的不變性等。通過對43個頂尖MLLMs的全面評估，我們發現了顯著的局限性：雖然模型在表層感知任務上表現出能力，但在組合推理及面對擾動時保持空間一致性方面卻存在困難。我們的分析進一步揭示了模型架構、規模與透視能力之間的有趣關聯，既指出了魯棒性瓶頸，也凸顯了思維鏈提示的益處。MMPerspective為診斷與推進視覺語言系統中的空間理解建立了一個寶貴的測試平臺。資源可訪問：https://yunlong10.github.io/MMPerspective/

English

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

MMPerspective：多模态大语言模型是否理解视角？一个全面的视角感知、推理与鲁棒性基准测试

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

摘要

Support