MELLA：为低资源语言多模态大模型架起语言能力与文化根基的桥梁

摘要

多模态大语言模型（MLLMs）在高资源语言环境中展现了卓越的性能。然而，在低资源语言情境下，其效能显著下降。当前的多语言增强方法往往局限于文本模态或仅依赖机器翻译。尽管这些方法有助于模型掌握基本语言能力并生成“浅层描述”，但它们忽视了多模态信息丰富性和文化根基的重要性，这两者对于有效服务低资源语言用户至关重要。为弥合这一差距，本研究确立了在低资源语言环境中真正有效的MLLM应追求的两大目标：1) 语言能力，2) 文化根基，特别强调文化意识。为实现这双重目标，我们提出了一种双源策略，指导针对每个目标的数据收集，即从原生网络替代文本中获取文化信息，利用MLLM生成字幕强化语言能力。作为具体实施，我们引入了MELLA，一个多模态、多语言数据集。实验结果显示，在MELLA上微调后，基于不同MLLM架构的八种语言模型普遍性能提升，模型能够生成“深层描述”。我们验证了性能提升源自文化知识的增强与语言能力的提升。我们的数据集可在https://opendatalab.com/applyMultilingualCorpus获取。

English

Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

MELLA：为低资源语言多模态大模型架起语言能力与文化根基的桥梁

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

摘要

Support