棱镜式VLMs：探讨在视觉条件下的语言模型设计空间

摘要

在视觉条件语言模型（VLMs）中的广泛应用，如视觉对话、场景理解和机器人任务规划，推动了诸如LLaVa、InstructBLIP和PaLI-3等众多新模型的涌现。尽管发布了大量新模型，但围绕图像预处理、架构和优化的关键设计决策尚未得到充分探讨，这使得理解模型性能的因素变得具有挑战性。这一挑战进一步复杂化，因为缺乏客观、一致的评估。为了填补这些空白，我们首先编制了一套标准化评估，涵盖视觉问答、语言中的物体定位以及探究诸如幻觉等属性的定向挑战集，这些评估为我们提供了对VLM能力的校准、细致洞察。其次，我们严格调查VLMs沿着关键设计轴线，包括预训练视觉表示和量化使用基础语言模型与指导调整语言模型之间的权衡，等等。我们将分析与三项资源贡献相结合：（1）用于评估VLMs的统一框架，（2）用于VLM训练的优化、灵活的代码，以及（3）所有模型的检查点，包括一系列严格优于InstructBLIP和LLaVa v1.5的VLMs，这是开源VLMs的最新技术水平，规模为7-13B。

English

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance - a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.

棱镜式VLMs：探讨在视觉条件下的语言模型设计空间

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

摘要

Support