我們離智能視覺演繹推理有多遠?
How Far Are We from Intelligent Visual Deductive Reasoning?
March 7, 2024
作者: Yizhe Zhang, He Bai, Ruixiang Zhang, Jiatao Gu, Shuangfei Zhai, Josh Susskind, Navdeep Jaitly
cs.AI
摘要
視覺語言模型(VLMs)如GPT-4V最近在各種視覺語言任務上展示了令人難以置信的進展。我們深入探討基於視覺的演繹推理,這是一個更複雜但較少被探索的領域,並發現了當前領先技術的VLMs中以前未曝光的盲點。具體來說,我們利用雷文進階矩陣(RPMs)來評估VLMs僅依賴視覺線索進行多跳關聯和演繹推理的能力。我們對幾個流行的VLMs進行了全面評估,採用了標準策略,如上下文學習、自我一致性和思維鏈(CoT),在包括Mensa智商測試、智力測試和RAVEN在內的三個不同數據集上進行評估。結果顯示,儘管LLMs在基於文本的推理方面具有令人印象深刻的能力,但在視覺演繹推理方面,我們仍然遠遠沒有達到可比擬的熟練水平。我們發現,對LLMs有效的某些標準策略並不完全適用於視覺推理任務所提出的挑戰。此外,詳細分析顯示,VLMs難以解決這些任務主要是因為它們無法感知和理解RPM示例中的多個混淆抽象模式。
English
Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated
incredible strides on diverse vision language tasks. We dig into vision-based
deductive reasoning, a more sophisticated but less explored realm, and find
previously unexposed blindspots in the current SOTA VLMs. Specifically, we
leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to
perform multi-hop relational and deductive reasoning relying solely on visual
clues. We perform comprehensive evaluations of several popular VLMs employing
standard strategies such as in-context learning, self-consistency, and
Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test,
IntelligenceTest, and RAVEN. The results reveal that despite the impressive
capabilities of LLMs in text-based reasoning, we are still far from achieving
comparable proficiency in visual deductive reasoning. We found that certain
standard strategies that are effective when applied to LLMs do not seamlessly
translate to the challenges presented by visual reasoning tasks. Moreover, a
detailed analysis reveals that VLMs struggle to solve these tasks mainly
because they are unable to perceive and comprehend multiple, confounding
abstract patterns in RPM examples.