Prism:一個用於解耦和評估VLMs能力的框架
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
June 20, 2024
作者: Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
cs.AI
摘要
視覺語言模型(VLMs)展現出在應對各種視覺問題方面的卓越能力,這需要強大的知覺和推理能力。為了模型的精煉,獨立評估這兩種能力至關重要,儘管由於現有VLMs中視覺感知和推理的緊密聯繫,這本身就是一個困難。為了應對這個問題,我們提出了Prism,這是一個創新的框架,旨在解開視覺問題解決中涉及的知覺和推理過程。Prism包括兩個獨立的階段:一個利用VLM來提取和闡述視覺信息的知覺階段,以及一個利用大型語言模型(LLM)根據提取的視覺信息制定回應的推理階段。這種模塊化設計使得可以系統性地比較和評估專有和開源VLM的知覺和推理優勢。我們的分析框架提供了一些有價值的見解,突顯了Prism作為視覺語言任務的成本效益解決方案的潛力。通過將專注於知覺的精簡VLM與專為推理而設的強大LLM結合,Prism在一般視覺語言任務中取得了優異的結果,同時大幅減少了培訓和運營成本。定量評估顯示,當配置為基於普通2B LLaVA和免費可訪問的GPT-3.5的Prism,在嚴格的多模式基準MMStar上提供了與規模大10倍的VLMs相當的性能。該項目已發布在:https://github.com/SparksJoe/Prism。
English
Vision Language Models (VLMs) demonstrate remarkable proficiency in
addressing a wide array of visual questions, which requires strong perception
and reasoning faculties. Assessing these two competencies independently is
crucial for model refinement, despite the inherent difficulty due to the
intertwined nature of seeing and reasoning in existing VLMs. To tackle this
issue, we present Prism, an innovative framework designed to disentangle the
perception and reasoning processes involved in visual question solving. Prism
comprises two distinct stages: a perception stage that utilizes a VLM to
extract and articulate visual information in textual form, and a reasoning
stage that formulates responses based on the extracted visual information using
a Large Language Model (LLM). This modular design enables the systematic
comparison and assessment of both proprietary and open-source VLM for their
perception and reasoning strengths. Our analytical framework provides several
valuable insights, underscoring Prism's potential as a cost-effective solution
for vision-language tasks. By combining a streamlined VLM focused on perception
with a powerful LLM tailored for reasoning, Prism achieves superior results in
general vision-language tasks while substantially cutting down on training and
operational expenses. Quantitative evaluations show that Prism, when configured
with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on
par with VLMs 10 times larger on the rigorous multimodal benchmark MMStar.
The project is released at: https://github.com/SparksJoe/Prism.Summary
AI-Generated Summary