프리즈매틱 VLM: 시각적 조건부 언어 모델의 설계 공간 탐구

초록

시각적으로 조건화된 언어 모델(VLMs)은 시각적 대화, 장면 이해, 로봇 작업 계획과 같은 응용 분야에서 점점 더 널리 채택되고 있으며, 이러한 채택은 LLaVa, InstructBLIP, PaLI-3과 같은 새로운 모델들의 풍부한 발전을 촉진했습니다. 새로운 모델들이 많이 출시되고 있음에도 불구하고, 이미지 전처리, 아키텍처, 최적화와 같은 핵심 설계 결정들은 충분히 탐구되지 않아 모델 성능에 어떤 요소들이 기여하는지 이해하기 어려운 상황입니다. 이는 객관적이고 일관된 평가의 부재로 인해 더욱 복잡해지는 문제입니다. 이러한 격차를 해결하기 위해, 우리는 먼저 시각적 질문 응답, 언어 기반 객체 위치 파악, 그리고 환각과 같은 속성을 탐구하는 표적 도전 세트를 포함한 표준화된 평가 모음을 구성했습니다. 이 평가들은 VLM의 능력에 대해 보정된 세밀한 통찰력을 제공합니다. 둘째, 우리는 사전 학습된 시각적 표현과 기본 언어 모델 대 지시 튜닝 언어 모델 사용의 트레이드오프를 정량화하는 등 주요 설계 축을 따라 VLMs를 엄격히 조사했습니다. 우리의 분석과 함께 세 가지 자원 기여를 제공합니다: (1) VLMs를 평가하기 위한 통합 프레임워크, (2) VLM 훈련을 위한 최적화된 유연한 코드, (3) 모든 모델에 대한 체크포인트, 특히 오픈소스 VLMs의 최첨단 기술인 InstructBLIP과 LLaVa v1.5를 엄격히 능가하는 7-13B 규모의 VLM 패밀리를 포함합니다.

English

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance - a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.

프리즈매틱 VLM: 시각적 조건부 언어 모델의 설계 공간 탐구

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

초록

Support