鸚鵡：多語言視覺指令調整

摘要

多模式大型語言模型（MLLMs）如GPT-4V的快速發展標誌著人工通用智能邁出了重要一步。現有方法主要集中在通過監督微調（SFT）將視覺編碼器與LLMs對齊，賦予LLMs多模式能力，使得MLLMs對多種語言的固有反應能力隨著訓練過程的演進逐漸惡化。我們實證發現，SFT數據集存在不平衡，主要由以英語為中心的圖像-文本對組成，導致非英語語言的表現顯著降低。這是由於在SFT過程中未能將視覺編碼器和LLM與多語言標記對齊所致。本文介紹了Parrot，一種利用文本指導在語言級別驅動視覺標記對齊的新方法。Parrot使視覺標記依賴於多樣的語言輸入，並使用專家混合（MoE）來促進多語言標記的對齊。具體來說，為了增強非英語視覺標記的對齊，我們使用初始視覺特徵和文本嵌入計算交叉注意力，其結果被餵入MoE路由器以選擇最相關的專家。所選專家隨後將初始視覺標記轉換為特定語言的視覺標記。此外，考慮到目前缺乏用於評估領域內多語言能力的基準，我們收集並提供了一個包含6種語言、15個類別和12,000個問題的大規模多語言多模式基準測試集，名為MMMB。我們的方法不僅在多語言MMBench和MMMB上展示了最先進的性能，還在各種多模式任務中表現優異。Parrot的源代碼和訓練數據集將公開提供。

English

The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

鸚鵡：多語言視覺指令調整

Parrot: Multilingual Visual Instruction Tuning

摘要

Support