プリズマティックVLM：視覚条件付き言語モデルの設計空間の探求

要旨

視覚条件付き言語モデル（VLM）は、視覚的対話、シーン理解、ロボットタスクプランニングなどのアプリケーションで採用が拡大しており、LLaVa、InstructBLIP、PaLI-3などの新しいモデルの開発を促進しています。新モデルのリリースが相次ぐ中、画像の前処理、アーキテクチャ、最適化に関する重要な設計決定は十分に検討されておらず、モデルの性能を左右する要因を理解することが困難です。この課題は、客観的で一貫した評価の欠如によってさらに複雑化しています。これらのギャップを埋めるため、まず、視覚的質問応答、言語からの物体位置特定、幻覚などの特性を探るターゲットチャレンジセットを含む標準化された評価スイートを構築し、VLMの能力を較正された細かい粒度で洞察する評価を提供します。次に、事前学習された視覚表現や、ベースモデルと指示チューニングされた言語モデルの使用のトレードオフの定量化など、主要な設計軸に沿ってVLMを厳密に調査します。この分析に加えて、3つのリソースを提供します：（1）VLMを評価するための統一フレームワーク、（2）VLMトレーニングのための最適化された柔軟なコード、（3）すべてのモデルのチェックポイント、特にInstructBLIPやLLaVa v1.5といったオープンソースVLMの最先端を厳密に上回る7-13BスケールのVLMファミリーを含みます。

English

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance - a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.

プリズマティックVLM：視覚条件付き言語モデルの設計空間の探求

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

要旨

Support