Mind's Eye: マルチモーダル大規模言語モデルのための視覚的抽象化・変換・合成ベンチマーク

要旨

マルチモーダル大規模言語モデル（MLLM）は視覚言語ベンチマークで顕著な進展を遂げているが、視覚的認知および視覚空間的推論能力については未解明な部分が多い。本研究では、古典的な人間の知能検査に着想を得た8つの視覚認知タスクから構成される多肢選択式ベンチマーク「Mind's Eye」を提案する。タスクは新たに考案した「A-R-T」 taxonomy（抽象化、関係性、変換）に基づいて体系化され、パターン帰納、類推的関係マッピング、心的変換といった流動性知能の中核的プロセスを測定する。閉鎖系およびオープンソースの多様なMLLMを評価し、その性能を被験者（人間）の成績と比較した。人間の正答率は80%に達したのに対し、最高性能のMLLMでも50%を下回った。誤り分析からは、(i)視覚的注意の配分、(ii)内的知覚操作、(iii)基礎となる視覚概念の抽象化の弱さ、における失敗が明らかになった。これらの結果は、現行のMLLMが人間と比較して限定的な視覚空間推論能力しか有していないことを示唆し、認知科学的基盤に立った評価枠組みの必要性を浮き彫りにする。

English

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

Mind's Eye: マルチモーダル大規模言語モデルのための視覚的抽象化・変換・合成ベンチマーク

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

要旨

Support