我们离智能视觉演绎推理有多远？

摘要

视觉-语言模型（VLMs）如GPT-4V最近在各种视觉语言任务上展示了令人难以置信的进展。我们深入研究了基于视觉的演绎推理，这是一个更复杂但较少被探索的领域，并发现了当前最先进的VLMs中以前未暴露的盲点。具体来说，我们利用雷文渐进矩阵（RPMs）来评估VLMs仅依赖视觉线索执行多跳关系和演绎推理的能力。我们对几种流行的VLMs进行了全面评估，采用了标准策略，如上下文学习、自一致性和思维链（CoT），在包括Mensa智商测试、智力测试和RAVEN在内的三个不同数据集上进行评估。结果显示，尽管LLMs在基于文本的推理方面具有令人印象深刻的能力，但在视觉演绎推理方面，我们仍然远未达到可比较的熟练水平。我们发现，一些对LLMs有效的标准策略在应用于视觉推理任务时并不完全适用。此外，详细分析显示，VLMs难以解决这些任务主要是因为它们无法感知和理解RPM示例中的多个、混淆的抽象模式。

English

Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. Moreover, a detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.

我们离智能视觉演绎推理有多远？

How Far Are We from Intelligent Visual Deductive Reasoning?

摘要

Support