ChatPaper.aiChatPaper

Prism:一個用於解耦和評估VLMs能力的框架

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

June 20, 2024
作者: Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
cs.AI

摘要

視覺語言模型(VLMs)展現出在應對各種視覺問題方面的卓越能力,這需要強大的知覺和推理能力。為了模型的精煉,獨立評估這兩種能力至關重要,儘管由於現有VLMs中視覺感知和推理的緊密聯繫,這本身就是一個困難。為了應對這個問題,我們提出了Prism,這是一個創新的框架,旨在解開視覺問題解決中涉及的知覺和推理過程。Prism包括兩個獨立的階段:一個利用VLM來提取和闡述視覺信息的知覺階段,以及一個利用大型語言模型(LLM)根據提取的視覺信息制定回應的推理階段。這種模塊化設計使得可以系統性地比較和評估專有和開源VLM的知覺和推理優勢。我們的分析框架提供了一些有價值的見解,突顯了Prism作為視覺語言任務的成本效益解決方案的潛力。通過將專注於知覺的精簡VLM與專為推理而設的強大LLM結合,Prism在一般視覺語言任務中取得了優異的結果,同時大幅減少了培訓和運營成本。定量評估顯示,當配置為基於普通2B LLaVA和免費可訪問的GPT-3.5的Prism,在嚴格的多模式基準MMStar上提供了與規模大10倍的VLMs相當的性能。該項目已發布在:https://github.com/SparksJoe/Prism。
English
Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs 10 times larger on the rigorous multimodal benchmark MMStar. The project is released at: https://github.com/SparksJoe/Prism.

Summary

AI-Generated Summary

PDF362December 2, 2024