MELLA：為低資源語言多模態大模型架起語言能力與文化根基的橋樑

摘要

多模态大型語言模型（MLLMs）在高資源語言中展現了卓越的性能。然而，在低資源語言的背景下，其效能顯著下降。當前的多語言增強方法往往僅限於文本模態或單純依賴機器翻譯。雖然這些方法有助於模型獲取基本的語言能力並產生“淺層描述”，但它們忽視了多模態信息豐富性和文化根基的重要性，這兩者對於有效服務低資源語言使用者至關重要。為彌補這一差距，在本研究中，我們確定了在低資源語言環境中真正有效的MLLM的兩個重要目標，即1）語言能力和2）文化根基，特別強調文化意識。為實現這雙重目標，我們提出了一種雙源策略，指導針對每個目標收集數據，從本地網絡的替代文本中獲取文化信息，並利用MLLM生成的描述來增強語言學能力。作為具體實施，我們引入了MELLA，一個多模態、多語言的數據集。實驗結果顯示，在MELLA上進行微調後，各種MLLM骨幹在八種語言上的性能普遍提升，模型產生了“深層描述”。我們證實，性能提升來自於文化知識的增強和語言能力的提升。我們的數據集可在https://opendatalab.com/applyMultilingualCorpus找到。

English

Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

MELLA：為低資源語言多模態大模型架起語言能力與文化根基的橋樑

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

摘要

Support