知的視覚的演繹推論まで、我々はどれほど近づいているのか？

要旨

GPT-4VのようなVision-Language Models（VLM）は、最近、多様な視覚言語タスクにおいて驚異的な進歩を示しています。本研究では、より洗練されているが未開拓の領域である視覚ベースの演繹的推論に焦点を当て、現在の最先端VLMに存在する未発見の盲点を明らかにします。具体的には、Raven's Progressive Matrices（RPM）を活用し、視覚的な手がかりのみに依存したマルチホップの関係的および演繹的推論能力を評価します。Mensa IQテスト、IntelligenceTest、RAVENを含む3つの多様なデータセットにおいて、インコンテキスト学習、自己一貫性、Chain-of-thoughts（CoT）などの標準的な戦略を用いて、いくつかの人気VLMを包括的に評価します。その結果、テキストベースの推論におけるLLMの印象的な能力にもかかわらず、視覚的演繹推論において同等の熟練度を達成するにはまだ遠いことが明らかになりました。LLMに適用された場合に有効な特定の標準戦略が、視覚的推論タスクの課題にシームレスに適用されないことが判明しました。さらに、詳細な分析により、VLMがこれらのタスクを解決するのに苦労する主な理由は、RPMの例に含まれる複数の抽象的なパターンを認識し理解できないためであることが明らかになりました。

English

Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. Moreover, a detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.

知的視覚的演繹推論まで、我々はどれほど近づいているのか？

How Far Are We from Intelligent Visual Deductive Reasoning?

要旨

Support