Parrot: 多言語視覚指示チューニング

要旨

GPT-4Vのようなマルチモーダル大規模言語モデル（MLLM）の急速な発展は、人工汎用知能に向けた重要な一歩を記しています。既存の手法は主に、教師ありファインチューニング（SFT）を通じて視覚エンコーダと言語モデル（LLM）を整合させ、LLMにマルチモーダル能力を付与することに焦点を当てていますが、これによりMLLMの多言語対応能力がトレーニングプロセスの進行に伴って徐々に低下する傾向があります。私たちは、英語中心の画像-テキストペアで構成される不均衡なSFTデータセットが、非英語言語でのパフォーマンスを大幅に低下させることを実証的に発見しました。これは、SFTプロセス中に視覚エンコーダとLLMを多言語トークンと整合させることに失敗したためです。本論文では、言語レベルで視覚トークンの整合を駆動するためにテキストガイダンスを活用する新しい手法「Parrot」を紹介します。Parrotは、視覚トークンを多様な言語入力に条件付けし、Mixture-of-Experts（MoE）を使用して多言語トークンの整合を促進します。具体的には、非英語の視覚トークン整合を強化するために、初期視覚特徴とテキスト埋め込みを使用してクロスアテンションを計算し、その結果をMoEルーターに供給して最も関連性の高いエキスパートを選択します。選択されたエキスパートは、初期視覚トークンを言語固有の視覚トークンに変換します。さらに、現在この分野で多言語能力を評価するためのベンチマークが不足していることを考慮し、6言語、15カテゴリ、12,000問を含む大規模多言語マルチモーダルベンチマーク「MMMB」を収集し公開します。私たちの手法は、多言語MMBenchおよびMMMBで最先端のパフォーマンスを示すだけでなく、幅広いマルチモーダルタスクでも優れた結果を達成します。Parrotのソースコードとトレーニングデータセットは、一般公開される予定です。

English

The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

Parrot: 多言語視覚指示チューニング

Parrot: Multilingual Visual Instruction Tuning

要旨

Support