M^3IT:面向多模态多语言指令调整的大规模数据集
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
June 7, 2023
作者: Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, Qi Liu
cs.AI
摘要
指导调优已显著推进大型语言模型(LLMs),如ChatGPT,使它们能够在各种任务中与人类指导相匹配。然而,开放式视觉-语言模型(VLMs)的进展受限于高质量指导数据集的稀缺。为了解决这一挑战并推动视觉-语言领域的研究,我们引入了多模态、多语言指导调优(M^3IT)数据集,旨在优化VLM与人类指导的对齐。我们的M^3IT数据集包括40个精心策划的数据集,包括240万个实例和400个手动编写的任务指导,重新格式化为视觉到文本结构。关键任务被翻译成80种语言,采用先进的翻译系统,确保更广泛的可访问性。M^3IT在任务覆盖范围、指导数量和实例规模方面超越了先前的数据集。此外,我们开发了Ying-VLM,这是一个在我们的M^3IT数据集上训练的VLM模型,展示了它回答需要世界知识的复杂问题、泛化到未见过的视频任务,并理解中文未见指导的潜力。为了鼓励进一步研究,我们已开源数据集和训练模型。
English
Instruction tuning has significantly advanced large language models (LLMs)
such as ChatGPT, enabling them to align with human instructions across diverse
tasks. However, progress in open vision-language models (VLMs) has been limited
due to the scarcity of high-quality instruction datasets. To tackle this
challenge and promote research in the vision-language field, we introduce the
Multi-Modal, Multilingual Instruction Tuning (M^3IT) dataset, designed to
optimize VLM alignment with human instructions. Our M^3IT dataset comprises
40 carefully curated datasets, including 2.4 million instances and 400 manually
written task instructions, reformatted into a vision-to-text structure. Key
tasks are translated into 80 languages with an advanced translation system,
ensuring broader accessibility. M^3IT surpasses previous datasets regarding
task coverage, instruction number and instance scale. Moreover, we develop
Ying-VLM, a VLM model trained on our M^3IT dataset, showcasing its potential
to answer complex questions requiring world knowledge, generalize to unseen
video tasks, and comprehend unseen instructions in Chinese. To encourage
further research, we have open-sourced both the dataset and trained models.