Prism: VLMの能力を分離・評価するためのフレームワーク

要旨

Vision Language Models（VLM）は、視覚的質問に対処する際に優れた能力を示し、強力な知覚と推論の能力を必要とします。既存のVLMでは、視覚と推論が密接に絡み合っているため、これら2つの能力を独立して評価することは困難ですが、モデルの改善には不可欠です。この問題に対処するため、我々はPrismという革新的なフレームワークを提案します。Prismは、視覚的質問解決における知覚と推論のプロセスを分離するように設計されています。Prismは2つの異なる段階で構成されています：知覚段階では、VLMを使用して視覚情報を抽出し、テキスト形式で表現します。推論段階では、抽出された視覚情報に基づいて、Large Language Model（LLM）を使用して回答を導き出します。このモジュール設計により、独自およびオープンソースのVLMの知覚と推論の強みを体系的に比較・評価することが可能です。我々の分析フレームワークは、Prismが視覚言語タスクにおけるコスト効率の高いソリューションとしての潜在能力を強調するいくつかの貴重な洞察を提供します。知覚に特化した簡素化されたVLMと、推論に特化した強力なLLMを組み合わせることで、Prismは一般的な視覚言語タスクで優れた結果を達成し、トレーニングおよび運用コストを大幅に削減します。定量的評価によると、Prismは、標準的な2B LLaVAと無料で利用可能なGPT-3.5を設定した場合、厳格なマルチモーダルベンチマークMMStarにおいて、10倍大きいVLMと同等のパフォーマンスを発揮します。プロジェクトは以下で公開されています：https://github.com/SparksJoe/Prism。

English

Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs 10 times larger on the rigorous multimodal benchmark MMStar. The project is released at: https://github.com/SparksJoe/Prism.

Prism: VLMの能力を分離・評価するためのフレームワーク

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

要旨

Support