MIMIC-IT: 다중 모달 인-컨텍스트 명령어 튜닝

초록

대규모 언어 모델의 대화형 자연어 작업에서 제로샷 성능을 극대화하기 위해서는 고품질의 지시문과 응답이 필수적이다. 복잡한 시각적 장면을 포함하는 대화형 시각-언어 작업의 경우, 시각-언어 모델(VLM)을 튜닝하기 위해 다양하고 창의적인 지시문-응답 쌍이 대량으로 필요하다. 그러나 현재 시각-언어 지시문-응답 쌍의 양, 다양성, 창의성 측면에서의 가용성은 여전히 제한적이며, 이는 대화형 VLM의 일반화에 도전 과제로 작용하고 있다. 본 연구에서는 2.8백만 개의 다중 모달 지시문-응답 쌍으로 구성된 MIMIC-IT(MultI-Modal In-Context Instruction Tuning) 데이터셋을 제안한다. 이 중 2.2백만 개의 고유한 지시문은 이미지와 비디오에서 도출되었다. 각 쌍은 다중 모달 컨텍스트 정보와 함께 제공되어, VLM의 인지, 추론, 계획 능력을 강화하기 위한 대화형 컨텍스트를 형성한다. 지시문-응답 수집 프로세스인 Syphus는 인간 전문가의 지식과 GPT의 능력을 결합한 자동 주석 파이프라인을 통해 확장되었다. MIMIC-IT 데이터셋을 사용하여 Otter라는 대규모 VLM을 학습시켰다. 시각-언어 벤치마크에서 수행된 광범위한 평가 결과, Otter는 다중 모달 인지, 추론, 컨텍스트 학습에서 뛰어난 숙련도를 보여주는 것으로 관찰되었다. 인간 평가 결과, 이 모델은 사용자의 의도와 효과적으로 일치하는 것으로 나타났다. 본 연구는 MIMIC-IT 데이터셋, 지시문-응답 수집 파이프라인, 벤치마크, 그리고 Otter 모델을 공개한다.

English

High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.

MIMIC-IT: 다중 모달 인-컨텍스트 명령어 튜닝

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

초록

Support