M^3IT: 멀티모달 다국어 명령어 튜닝을 위한 대규모 데이터셋

초록

인스트럭션 튜닝은 ChatGPT와 같은 대형 언어 모델(LLMs)을 크게 발전시켜 다양한 작업에서 인간의 지시에 맞출 수 있게 했습니다. 그러나 고품질 인스트럭션 데이터셋의 부족으로 인해 오픈 비전-언어 모델(VLMs)의 발전은 제한적이었습니다. 이 문제를 해결하고 비전-언어 분야의 연구를 촉진하기 위해, 우리는 인간의 지시에 맞춰 VLM을 최적화하도록 설계된 다중 모달, 다국어 인스트럭션 튜닝(M^3IT) 데이터셋을 소개합니다. 우리의 M^3IT 데이터셋은 40개의 신중하게 선별된 데이터셋으로 구성되어 있으며, 240만 개의 인스턴스와 400개의 수동으로 작성된 작업 지시를 포함하고 있으며, 이를 비전-텍스트 구조로 재구성했습니다. 주요 작업은 고급 번역 시스템을 통해 80개 언어로 번역되어 더 넓은 접근성을 보장합니다. M^3IT는 작업 범위, 지시 수 및 인스턴스 규모 측면에서 이전 데이터셋을 능가합니다. 또한, 우리는 M^3IT 데이터셋으로 훈련된 Ying-VLM 모델을 개발하여, 세계 지식이 필요한 복잡한 질문에 답하고, 보지 못한 비디오 작업에 일반화하며, 중국어로 된 보지 못한 지시를 이해할 수 있는 잠재력을 보여줍니다. 더 많은 연구를 장려하기 위해, 우리는 데이터셋과 훈련된 모델을 오픈소스로 공개했습니다.

English

Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M^3IT) dataset, designed to optimize VLM alignment with human instructions. Our M^3IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M^3IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M^3IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. To encourage further research, we have open-sourced both the dataset and trained models.

M^3IT: 멀티모달 다국어 명령어 튜닝을 위한 대규모 데이터셋

M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

초록

Support