ChatPaper.aiChatPaper

M^3IT:針對多模態多語言指導微調的大規模數據集

M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

June 7, 2023
作者: Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, Qi Liu
cs.AI

摘要

指令調整已顯著提升大型語言模型(LLMs)如ChatGPT,使其能夠在各種任務中與人類指令保持一致。然而,開放式視覺語言模型(VLMs)的進展受限於高質量指令數據集的稀缺。為應對這一挑戰並促進視覺語言領域的研究,我們引入了多模態、多語言指令調整(M^3IT)數據集,旨在優化VLM與人類指令的對齊。我們的M^3IT數據集包含40個精心策劃的數據集,包括240萬個實例和400個手動編寫的任務指令,重新格式化為視覺到文本結構。關鍵任務被翻譯成80種語言,採用先進的翻譯系統,確保更廣泛的可訪問性。M^3IT在任務涵蓋範圍、指令數量和實例規模方面超越了以往的數據集。此外,我們開發了Ying-VLM,一個在我們的M^3IT數據集上訓練的VLM模型,展示其潛力來回答需要世界知識的複雜問題,泛化到未見的視頻任務,並理解中文中未見的指令。為了鼓勵進一步研究,我們已將數據集和訓練模型開源。
English
Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M^3IT) dataset, designed to optimize VLM alignment with human instructions. Our M^3IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M^3IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M^3IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. To encourage further research, we have open-sourced both the dataset and trained models.
PDF81December 15, 2024