像素与模式，却无诗意：以人类视角看世界

摘要

在多模态大语言模型（MLLMs）中实现类人感知与推理能力，仍是人工智能领域的一项核心挑战。尽管近期研究主要集中于提升MLLMs的推理能力，但一个根本性问题始终存在：多模态大语言模型能否真正像人类一样感知世界？本文从推理转向感知，不再专门构建推理基准，而是引入了图灵视觉测试（TET），这是一项以感知为导向的挑战性基准，包含四项诊断任务，用于评估MLLMs在处理人类直觉上易于理解的合成图像时的表现。我们的研究发现，在人类看来轻而易举的感知任务上，当前最先进的MLLMs却遭遇了灾难性的失败。无论是上下文学习还是针对以往基准有效的语言主干训练，均未能提升模型在我们任务上的表现，而视觉模块的微调则能迅速适应，这表明我们的基准对视觉模块的泛化能力提出了挑战，而非针对语言主干的知识与推理能力——这正是当前MLLMs与人类感知之间的一大差距。本版本中，我们发布了TET任务的一个代表性子集，未来工作将引入更多样化的任务与方法，以增强视觉泛化能力。

English

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

像素与模式，却无诗意：以人类视角看世界

Pixels, Patterns, but No Poetry: To See The World like Humans

摘要

Support