視覺條件語言模型的設計空間探討

摘要

視覺條件語言模型（VLMs）在視覺對話、場景理解和機器人任務規劃等應用中得到日益廣泛的應用；這種應用推動了眾多新模型的出現，如LLaVa、InstructBLIP和PaLI-3。儘管有大量新模型問世，但對於圖像預處理、架構和優化等關鍵設計決策尚未深入探討，這使得理解模型性能所受影響因素變得具有挑戰性 - 這種挑戰進一步複雜化了缺乏客觀、一致性評估的情況。為了填補這些空白，我們首先編制了一系列標準化評估，涵蓋視覺問答、從語言中定位物體以及探測幻覺等目標挑戰集，這些評估提供了對VLM能力的校準、細緻洞察。其次，我們嚴謹地研究VLMs沿著關鍵設計軸，包括預訓練視覺表示和量化使用基礎語言模型與指導調整語言模型之間的權衡等方面。我們將分析與三項資源貢獻相結合：（1）用於評估VLMs的統一框架，（2）優化的、靈活的VLM訓練代碼，以及（3）所有模型的檢查點，包括一系列在7-13B規模上嚴格優於InstructBLIP和LLaVa v1.5的VLMs家族，後者是開源VLMs中的最新技術。

English

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance - a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.

視覺條件語言模型的設計空間探討

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

摘要

Support