閃爍：多模態大型語言模型能夠看見但無法感知

摘要

我們介紹了 Blink，這是一個針對多模式語言模型（LLMs）的新基準，專注於其他評估中找不到的核心視覺感知能力。Blink 的大部分任務可以被人類在「眨眼之間」內解決（例如，相對深度估計、視覺對應、取證檢測和多視角推理）。然而，我們發現這些對感知要求高的任務對當前的多模式 LLMs 構成了重大挑戰，因為它們無法通過自然語言進行調解。Blink 將14個經典計算機視覺任務重新格式化為3,807個多選題，配對單張或多張圖像和視覺提示。儘管人類平均準確率達到95.70％，Blink 對現有的多模式 LLMs 來說卻出奇地具有挑戰性：即使是表現最佳的 GPT-4V 和 Gemini，其準確率也僅分別為51.26％和45.72％，僅比隨機猜測高出13.17％和7.63％，這表明這種感知能力在最近的多模式 LLMs 中尚未「出現」。我們的分析還突顯了專業的 CV 模型能夠更好地解決這些問題，為未來的改進提供了潛在途徑。我們相信 Blink 將激勵社群幫助多模式 LLMs 追趕人類水平的視覺感知能力。

English

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

閃爍：多模態大型語言模型能夠看見但無法感知

BLINK: Multimodal Large Language Models Can See but Not Perceive

摘要

Support