MouSi: ポリビジュアルエキスパート視覚言語モデル

要旨

現在の大規模視覚言語モデル（VLM）は、単一の視覚コンポーネントの能力不足や過度に長い視覚トークンといった課題に直面することが多い。これらの問題は、複雑な視覚情報や過度に長い文脈情報を正確に解釈するモデルの効果を制限する可能性がある。これらの課題に対処することは、VLMの性能と適用性を向上させるために重要である。本論文では、アンサンブルエキスパート技術を提案し、画像テキストマッチング、OCR、画像セグメンテーションなどに熟練した個々の視覚エンコーダの能力を統合する。この技術は、異なる視覚エキスパートからの出力を統一的に処理するための融合ネットワークを導入し、画像エンコーダと事前学習済みLLMの間のギャップを埋める。さらに、長い画像特徴シーケンスによって引き起こされる位置エンコーディングの浪費を軽減するために、異なる位置エンコーディングスキームを探求し、位置オーバーフローと長さ制限の問題を効果的に解決する。例えば、我々の実装では、この技術により、SAMのようなモデルにおける位置占有を、大幅に4096からより効率的で管理しやすい64、さらには1にまで削減することができる。実験結果は、複数のエキスパートを備えたVLMが、孤立した視覚エンコーダを一貫して上回り、より多くのエキスパートが統合されるにつれて性能が大幅に向上することを示している。本報告で使用したトレーニングコードをオープンソースとして公開している。これらのリソースはすべて、プロジェクトのウェブサイトで見つけることができる。

English

Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.

MouSi: ポリビジュアルエキスパート視覚言語モデル

MouSi: Poly-Visual-Expert Vision-Language Models

要旨

Support