像素與模式，卻無詩意：以人類之眼觀世界

摘要

在多模態大型語言模型（MLLMs）中實現類人的感知與推理能力，仍然是人工智慧領域的核心挑戰。儘管近期研究主要集中於提升MLLMs的推理能力，但一個根本性問題依然存在：多模態大型語言模型能否真正像人類一樣感知世界？本文將焦點從推理轉向感知。我們並未專門構建推理基準，而是引入了圖靈眼測試（TET），這是一個以感知為導向的挑戰性基準，包含四項診斷任務，用於評估MLLMs在處理人類直覺理解下的合成圖像時的表現。我們的研究發現，當前最先進的MLLMs在這些對人類而言輕而易舉的感知任務上表現出災難性的失敗。無論是上下文學習還是針對先前基準有效的語言骨幹訓練，都未能提升我們任務中的表現，而對視覺塔進行微調則能實現快速適應，這表明我們的基準對視覺塔的泛化能力提出了挑戰，而非針對語言骨幹的知識與推理能力——這正是當前MLLMs與人類感知之間的一個關鍵差距。本版本中，我們發布了TET任務的一個代表性子集，並將在未來工作中引入更多多樣化的任務與方法，以增強視覺泛化能力。

English

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

像素與模式，卻無詩意：以人類之眼觀世界

Pixels, Patterns, but No Poetry: To See The World like Humans

摘要

Support