ChatPaper.aiChatPaper

MIMIC-IT:多模態上下文指導調整

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

June 8, 2023
作者: Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu
cs.AI

摘要

在互動自然語言任務中,高質量的指示和回應對於大型語言模型的零-shot表現至關重要。對於涉及複雜視覺場景的互動視覺-語言任務,大量多樣且具創意的指示-回應對是調整視覺-語言模型(VLMs)的關鍵。然而,目前關於視覺-語言指示-回應對在數量、多樣性和創意方面的可用性仍然有限,這對於互動VLMs的泛化提出了挑戰。在這裡,我們提出了MultI-Modal In-Context Instruction Tuning(MIMIC-IT),這是一個包含280萬個多模式指示-回應對的數據集,其中有來自圖像和視頻的220萬個獨特指示。每對都附帶多模式上下文信息,形成旨在增強VLMs在感知、推理和規劃方面的對話上下文。指示-回應收集過程被稱為Syphus,使用自動標註流程與GPT的能力相結合進行擴展。使用MIMIC-IT數據集,我們訓練了一個名為Otter的大型VLM。通過對視覺-語言基準進行的廣泛評估,觀察到Otter在多模式感知、推理和上下文學習方面表現出卓越的能力。人類評估顯示它能有效地與用戶意圖對齊。我們釋出了MIMIC-IT數據集、指示-回應收集流程、基準測試和Otter模型。
English
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
PDF110December 15, 2024