エンコーダーフリーの視覚言語モデルの解明

要旨

既存の視覚言語モデル（VLM）は、主に視覚エンコーダーを用いて視覚的特徴を抽出し、その後、大規模言語モデル（LLM）を利用して視覚言語タスクを実行します。しかし、視覚エンコーダーは、解像度、アスペクト比、セマンティックプライアなどの視覚表現の抽象化において強い帰納的バイアスを設定しており、これがVLMの柔軟性と効率性を妨げる可能性があります。視覚エンコーダーを使用せず、シームレスな視覚と言語の入力を直接受け入れる純粋なVLMのトレーニングは、依然として困難であり、ほとんど検討されていません。経験的な観察によると、エンコーダーなしで直接トレーニングを行うと、収束が遅く、性能のギャップが大きくなることが明らかになっています。本研究では、エンコーダーベースのモデルとエンコーダーフリーモデルの間のギャップを埋め、純粋なVLMに向けたシンプルかつ効果的なトレーニング手法を提案します。具体的には、徹底的な実験を通じて、エンコーダーフリーVLMを効率的にトレーニングするための重要な側面を明らかにします：（1）統一されたデコーダー内で視覚と言語の表現を橋渡しすること；（2）追加の監視を通じて視覚認識能力を強化すること。これらの戦略を用いて、効率的にトレーニングおよび推論可能なエンコーダーフリー視覚言語モデルであるEVEを開発しました。特に、35Mの公開データのみを利用することで、EVEは複数の視覚言語ベンチマークにおいて、同容量のエンコーダーベースVLMと驚くほど匹敵する性能を発揮します。また、トレーニング手順やデータが明らかにされていないFuyu-8Bを大幅に上回ります。EVEは、モダリティを超えた純粋なデコーダー専用アーキテクチャを開発するための透明かつ効率的な道筋を提供すると考えています。私たちのコードとモデルは、https://github.com/baaivision/EVE で公開されています。

English

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.

エンコーダーフリーの視覚言語モデルの解明

Unveiling Encoder-Free Vision-Language Models

要旨

Support