Prism: 시각-언어 모델의 능력을 분리하고 평가하기 위한 프레임워크

초록

비전 언어 모델(VLMs)은 강력한 인지 및 추론 능력을 요구하는 다양한 시각적 질문을 해결하는 데 있어 뛰어난 숙련도를 보여줍니다. 기존 VLMs에서 보기와 추론이 밀접하게 얽혀 있는 특성으로 인해 어려움이 있지만, 이 두 능력을 독립적으로 평가하는 것은 모델 개선에 있어 매우 중요합니다. 이 문제를 해결하기 위해, 우리는 시각적 문제 해결에 관여하는 인지와 추론 과정을 분리하도록 설계된 혁신적인 프레임워크인 Prism을 제안합니다. Prism은 두 가지 독립적인 단계로 구성됩니다: VLM을 활용하여 시각 정보를 추출하고 이를 텍스트 형태로 표현하는 인지 단계와, 추출된 시각 정보를 기반으로 대형 언어 모델(LLM)을 사용하여 응답을 구성하는 추론 단계입니다. 이 모듈식 설계는 독점 및 오픈소스 VLM의 인지 및 추론 강점을 체계적으로 비교하고 평가할 수 있게 합니다. 우리의 분석 프레임워크는 Prism이 비전-언어 작업을 위한 비용 효율적인 솔루션으로서의 잠재력을 강조하는 여러 유용한 통찰을 제공합니다. 인지에 초점을 맞춘 간소화된 VLM과 추론에 특화된 강력한 LLM을 결합함으로써, Prism은 일반적인 비전-언어 작업에서 우수한 결과를 달성하면서도 훈련 및 운영 비용을 크게 절감합니다. 정량적 평가 결과, Prism은 기본적인 2B LLaVA와 자유롭게 접근 가능한 GPT-3.5로 구성되었을 때, 엄격한 멀티모달 벤치마크 MMStar에서 10배 더 큰 VLMs와 동등한 성능을 보여줍니다. 이 프로젝트는 https://github.com/SparksJoe/Prism에서 공개되었습니다.

English

Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs 10 times larger on the rigorous multimodal benchmark MMStar. The project is released at: https://github.com/SparksJoe/Prism.

Prism: 시각-언어 모델의 능력을 분리하고 평가하기 위한 프레임워크

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

초록

Support