鸚鵡:多語言視覺指令調整
Parrot: Multilingual Visual Instruction Tuning
June 4, 2024
作者: Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
cs.AI
摘要
多模式大型語言模型(MLLMs)如GPT-4V的快速發展標誌著人工通用智能邁出了重要一步。現有方法主要集中在通過監督微調(SFT)將視覺編碼器與LLMs對齊,賦予LLMs多模式能力,使得MLLMs對多種語言的固有反應能力隨著訓練過程的演進逐漸惡化。我們實證發現,SFT數據集存在不平衡,主要由以英語為中心的圖像-文本對組成,導致非英語語言的表現顯著降低。這是由於在SFT過程中未能將視覺編碼器和LLM與多語言標記對齊所致。本文介紹了Parrot,一種利用文本指導在語言級別驅動視覺標記對齊的新方法。Parrot使視覺標記依賴於多樣的語言輸入,並使用專家混合(MoE)來促進多語言標記的對齊。具體來說,為了增強非英語視覺標記的對齊,我們使用初始視覺特徵和文本嵌入計算交叉注意力,其結果被餵入MoE路由器以選擇最相關的專家。所選專家隨後將初始視覺標記轉換為特定語言的視覺標記。此外,考慮到目前缺乏用於評估領域內多語言能力的基準,我們收集並提供了一個包含6種語言、15個類別和12,000個問題的大規模多語言多模式基準測試集,名為MMMB。我們的方法不僅在多語言MMBench和MMMB上展示了最先進的性能,還在各種多模式任務中表現優異。Parrot的源代碼和訓練數據集將公開提供。
English
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V
has marked a significant step towards artificial general intelligence. Existing
methods mainly focus on aligning vision encoders with LLMs through supervised
fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs'
inherent ability to react to multiple languages progressively deteriorate as
the training process evolves. We empirically find that the imbalanced SFT
datasets, primarily composed of English-centric image-text pairs, lead to
significantly reduced performance in non-English languages. This is due to the
failure of aligning the vision encoder and LLM with multilingual tokens during
the SFT process. In this paper, we introduce Parrot, a novel method that
utilizes textual guidance to drive visual token alignment at the language
level. Parrot makes the visual tokens condition on diverse language inputs and
uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens.
Specifically, to enhance non-English visual tokens alignment, we compute the
cross-attention using the initial visual features and textual embeddings, the
result of which is then fed into the MoE router to select the most relevant
experts. The selected experts subsequently convert the initial visual tokens
into language-specific visual tokens. Moreover, considering the current lack of
benchmarks for evaluating multilingual capabilities within the field, we
collect and make available a Massive Multilingual Multimodal Benchmark which
includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our
method not only demonstrates state-of-the-art performance on multilingual
MMBench and MMMB, but also excels across a broad range of multimodal tasks.
Both the source code and the training dataset of Parrot will be made publicly
available.Summary
AI-Generated Summary