我們離智能視覺演繹推理有多遠？

摘要

視覺語言模型（VLMs）如GPT-4V最近在各種視覺語言任務上展示了令人難以置信的進展。我們深入探討基於視覺的演繹推理，這是一個更複雜但較少被探索的領域，並發現了當前領先技術的VLMs中以前未曝光的盲點。具體來說，我們利用雷文進階矩陣（RPMs）來評估VLMs僅依賴視覺線索進行多跳關聯和演繹推理的能力。我們對幾個流行的VLMs進行了全面評估，採用了標準策略，如上下文學習、自我一致性和思維鏈（CoT），在包括Mensa智商測試、智力測試和RAVEN在內的三個不同數據集上進行評估。結果顯示，儘管LLMs在基於文本的推理方面具有令人印象深刻的能力，但在視覺演繹推理方面，我們仍然遠遠沒有達到可比擬的熟練水平。我們發現，對LLMs有效的某些標準策略並不完全適用於視覺推理任務所提出的挑戰。此外，詳細分析顯示，VLMs難以解決這些任務主要是因為它們無法感知和理解RPM示例中的多個混淆抽象模式。

English

Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. Moreover, a detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.

我們離智能視覺演繹推理有多遠？

How Far Are We from Intelligent Visual Deductive Reasoning?

摘要

Support