ピクセルとパターン、しかし詩情はなし：人間のように世界を見る

要旨

マルチモーダル大規模言語モデル（MLLM）において、人間のような知覚と推論を実現することは、人工知能における中心的な課題のままである。最近の研究は主にMLLMの推論能力の向上に焦点を当ててきたが、根本的な疑問が残っている：マルチモーダル大規模言語モデルは、本当に人間のように世界を認識できるのか？本論文は、推論から知覚へと焦点を移す。推論に特化したベンチマークを構築するのではなく、人間が直感的に処理する合成画像を用いてMLLMの性能を評価する4つの診断タスクからなる挑戦的な知覚指向のベンチマーク「チューリング・アイ・テスト（TET）」を導入する。我々の調査結果は、最先端のMLLMが、人間にとっては簡単な知覚タスクにおいて壊滅的な失敗を示すことを明らかにしている。以前のベンチマークで有効であったインコンテキスト学習や言語バックボーンのトレーニングは、我々のタスクでの性能向上に失敗する一方で、ビジョンタワーのファインチューニングは迅速な適応を可能にし、我々のベンチマークが言語バックボーンの知識と推論能力ではなく、ビジョンタワーの一般化に課題を提起していることを示唆している。これは、現在のMLLMと人間の知覚の間にある重要なギャップである。本バージョンでは、TETタスクの代表的なサブセットを公開し、今後の研究では視覚的一般化を強化するためのより多様なタスクと手法を導入する予定である。

English

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

ピクセルとパターン、しかし詩情はなし：人間のように世界を見る

Pixels, Patterns, but No Poetry: To See The World like Humans

要旨

Support