ChatPaper.aiChatPaper

MIMIC-IT:多模态上下文指导调整

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

June 8, 2023
作者: Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu
cs.AI

摘要

高质量的指导和回应对于大型语言模型在交互式自然语言任务上的零-shot性能至关重要。对于涉及复杂视觉场景的交互式视觉-语言任务,大量多样化和创意丰富的指导-回应对对于调整视觉-语言模型(VLMs)至关重要。然而,目前关于视觉-语言指导-回应对在数量、多样性和创意方面的可用性仍然有限,这给交互式VLMs的泛化带来挑战。在这里,我们提出了MultI-Modal In-Context Instruction Tuning(MIMIC-IT),这是一个包含280万个多模态指导-回应对的数据集,其中包括来自图像和视频的220万个独特指导。每对指导-回应都附带多模态上下文信息,形成旨在增强VLMs在感知、推理和规划方面的对话上下文。指导-回应收集过程被称为Syphus,通过将人类专业知识与GPT的能力相结合,使用自动注释流水线进行扩展。利用MIMIC-IT数据集,我们训练了一个名为Otter的大型VLM。基于在视觉-语言基准上进行的广泛评估,发现Otter在多模态感知、推理和上下文学习方面表现出卓越的熟练度。人类评估显示它能有效地与用户意图对齐。我们发布了MIMIC-IT数据集、指导-回应收集流水线、基准测试和Otter模型。
English
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
PDF110December 15, 2024