BRAVE: 視覚言語モデルの視覚的エンコーディングの拡張

要旨

ビジョン・ランゲージモデル（VLM）は通常、視覚エンコーダ（例：CLIP）と、エンコードされた特徴を解釈して下流タスクを解決する言語モデル（LM）で構成されています。顕著な進展にもかかわらず、VLMは視覚エンコーダの能力の限界により、特定の画像特徴に対する「盲目性」や視覚的幻覚などのいくつかの欠点に直面しています。これらの問題に対処するため、我々はVLMの視覚エンコーディング能力を拡大する方法を研究します。まず、異なる帰納的バイアスを持つ複数の視覚エンコーダをVLMタスク解決のために包括的にベンチマークします。その結果、異なるタスク間で一貫して最高のパフォーマンスを達成する単一のエンコーディング構成は存在せず、異なるバイアスを持つエンコーダが驚くほど類似した性能を発揮することが観察されました。これに動機づけられて、我々はBRAVEという手法を導入します。この手法は、複数の凍結されたエンコーダからの特徴を統合し、凍結されたLMへの入力として直接供給できるより汎用的な表現を生成します。BRAVEは、広範なキャプショニングおよびVQAベンチマークで最先端の性能を達成し、前述のVLMの問題を大幅に軽減します。さらに、既存の手法よりも少ない学習可能なパラメータ数とより圧縮された表現を実現します。我々の結果は、異なる視覚的バイアスを組み込むことで、VLMの視覚理解をより広範かつ文脈化する可能性を強調しています。

English

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.

BRAVE: 視覚言語モデルの視覚的エンコーディングの拡張

BRAVE: Broadening the visual encoding of vision-language models

要旨

Support