MMPerspective: MLLM이 관점을 이해하는가? 관점 인식, 추론 및 견고성을 위한 포괄적 벤치마크

초록

원근법 이해는 인간 시각 인지의 기본이지만, 다중 모드 대형 언어 모델(MLLMs)이 원근법 기하학을 내재화한 정도는 여전히 불분명합니다. 우리는 MMPerspective를 소개합니다. 이는 원근법 이해를 체계적으로 평가하기 위해 특별히 설계된 첫 번째 벤치마크로, 세 가지 상호 보완적인 차원(원근법 인지, 추론, 강건성)에 걸쳐 10개의 신중하게 설계된 과제를 포함합니다. 우리의 벤치마크는 소실점 인지 및 카운팅, 원근법 유형 추론, 3D 공간에서의 선 관계 이해, 원근법 보존 변환에 대한 불변성 등 핵심 능력을 탐구하는 2,711개의 실제 및 합성 이미지 인스턴스와 5,083개의 질문-답변 쌍으로 구성됩니다. 43개의 최신 MLLMs에 대한 포괄적인 평가를 통해 우리는 중요한 한계를 발견했습니다: 모델들은 표면적 인지 과제에서는 능력을 보이지만, 구성적 추론과 변형 하에서의 공간적 일관성 유지에는 어려움을 겪습니다. 우리의 분석은 모델 아키텍처, 규모, 원근법 능력 간의 흥미로운 패턴을 추가로 밝혀내며, 강건성 병목 현상과 사고 연쇄 프롬프트의 이점을 강조합니다. MMPerspective는 시각-언어 시스템에서 공간 이해를 진단하고 발전시키기 위한 가치 있는 테스트베드를 마련합니다. 자원은 다음에서 확인할 수 있습니다: https://yunlong10.github.io/MMPerspective/

English

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/