마음의 눈: 다중모드 LLM을 위한 시각적 추상화, 변환 및 구성 벤치마크

초록

멀티모달 대규모 언어 모델(MLLM)은 비전 언어 벤치마크에서 인상적인 발전을 이루었으나, 시각 인지 및 시공간 추론 능력은 여전히 잘 이해되지 않고 있습니다. 본 연구에서는 고전적인 인간 지능 검사에서 영감을 받고 새로운 "A-R-T" 분류 체계(추상화, 관계, 변환)로 구성된 8가지 시각 인지 과제의 객관식 벤치마크인 "Mind's Eye"를 소개합니다. 이 과제들은 패턴 귀납, 유사 관계 매핑, 심적 변환과 같은 유동 지능의 핵심 과정을 탐구합니다. 다양한 종류의 클로즈드 소스 및 오픈소스 MLLM을 평가하고 인간 참가자의 성능과 비교합니다. 인간은 80%의 정확도를 달성한 반면, 최고 성능의 MLLM은 50% 미만에 머물렀습니다. 오류 분석을 통해 (i) 시각 주의 할당, (ii) 내적 지각 조작, (iii) 기본 시각 개념에 대한 약한 추상화의 실패를 확인했습니다. 우리의 연구 결과는 현재 MLLM이 인간 참가자에 비해 제한된 시공간 추론 능력을 보여주며, 더 인지적으로 근거 있는 평가 프레임워크의 필요성을 강조합니다.

English

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

마음의 눈: 다중모드 LLM을 위한 시각적 추상화, 변환 및 구성 벤치마크

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

초록

Support