MIMIC-IT：多模態上下文指導調整

摘要

在互動自然語言任務中，高質量的指示和回應對於大型語言模型的零-shot表現至關重要。對於涉及複雜視覺場景的互動視覺-語言任務，大量多樣且具創意的指示-回應對是調整視覺-語言模型（VLMs）的關鍵。然而，目前關於視覺-語言指示-回應對在數量、多樣性和創意方面的可用性仍然有限，這對於互動VLMs的泛化提出了挑戰。在這裡，我們提出了MultI-Modal In-Context Instruction Tuning（MIMIC-IT），這是一個包含280萬個多模式指示-回應對的數據集，其中有來自圖像和視頻的220萬個獨特指示。每對都附帶多模式上下文信息，形成旨在增強VLMs在感知、推理和規劃方面的對話上下文。指示-回應收集過程被稱為Syphus，使用自動標註流程與GPT的能力相結合進行擴展。使用MIMIC-IT數據集，我們訓練了一個名為Otter的大型VLM。通過對視覺-語言基準進行的廣泛評估，觀察到Otter在多模式感知、推理和上下文學習方面表現出卓越的能力。人類評估顯示它能有效地與用戶意圖對齊。我們釋出了MIMIC-IT數據集、指示-回應收集流程、基準測試和Otter模型。

English

High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.

MIMIC-IT：多模態上下文指導調整

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

摘要

Support