ChatPaper.aiChatPaper

鹦鹉:多语言视觉指令调整

Parrot: Multilingual Visual Instruction Tuning

June 4, 2024
作者: Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
cs.AI

摘要

多模态大型语言模型(MLLMs)如GPT-4V的快速发展标志着人工通用智能迈出了重要一步。现有方法主要集中在通过监督微调(SFT)来使视觉编码器与LLMs对齐,赋予LLMs多模态能力,使得MLLMs对多种语言的固有反应能力随着训练过程的演变逐渐恶化。我们在实证研究中发现,SFT数据集存在不平衡,主要由以英语为中心的图像-文本对组成,导致非英语语言的性能显著下降。这是因为在SFT过程中未能对视觉编码器和LLM进行多语言标记的对齐。本文介绍了Parrot,一种利用文本指导驱动语言级别的视觉标记对齐的新方法。Parrot使视觉标记依赖于多样化的语言输入,并使用专家混合(MoE)来促进多语言标记的对齐。具体来说,为了增强非英语视觉标记的对齐,我们使用初始视觉特征和文本嵌入计算交叉注意力,其结果然后输入MoE路由器以选择最相关的专家。所选专家随后将初始视觉标记转换为特定语言的视觉标记。此外,考虑到目前领域内缺乏用于评估多语言能力的基准,我们收集并提供了一个包含6种语言、15个类别和12,000个问题的大规模多语言多模态基准,命名为MMMB。我们的方法不仅在多语言MMBench和MMMB上展示了最先进的性能,而且在广泛的多模态任务中表现出色。Parrot的源代码和训练数据集将公开提供。
English
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

Summary

AI-Generated Summary

PDF392December 12, 2024