闪烁：多模态大型语言模型能够看到但无法感知

摘要

我们介绍了 Blink，这是一个针对多模态语言模型（LLMs）的新基准，侧重于核心视觉感知能力，这是其他评估中没有的。Blink 中的大多数任务可以被人类在“眨眼之间”内解决（例如，相对深度估计、视觉对应、取证检测和多视角推理）。然而，我们发现这些对感知要求很高的任务对当前多模态LLMs构成了重大挑战，因为它们难以通过自然语言进行中介。Blink 将 14 个经典计算机视觉任务重新格式化为 3,807 个多项选择题，配对单个或多个图像和视觉提示。尽管人类平均准确率达到 95.70%，但对于现有的多模态LLMs 来说，Blink 却具有令人惊讶的挑战性：即使是表现最好的 GPT-4V 和 Gemini 的准确率也只有 51.26% 和 45.72%，仅比随机猜测高出 13.17% 和 7.63%，表明这种感知能力在最近的多模态LLMs 中尚未“出现”。我们的分析还突显了专业的 CV 模型能够更好地解决这些问题，为未来改进提供了潜在路径。我们相信 Blink 将激励社区帮助多模态LLMs 追赶人类水平的视觉感知能力。

English

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

闪烁：多模态大型语言模型能够看到但无法感知

BLINK: Multimodal Large Language Models Can See but Not Perceive

摘要

Support