SVIT：擴展視覺指令調整

摘要

基於基礎模型的出現，大型語言和視覺模型被整合以獲得視覺字幕、對話、問答等多模式能力。儘管現有的多模式模型展現了令人印象深刻的視覺理解和推理能力，但由於高質量指導調整數據的稀缺性，它們的極限仍然大部分未被探索。為了拓展多模式能力的極限，我們提出了視覺指導調整（SVIT），通過構建包括160萬對對話問答（QA）和160萬對複雜推理QA以及106,000個詳細圖像描述的320萬視覺指導調整數據集。除了數量之外，所提出的數據集還具有高質量和豐富多樣性的特點，這是通過使用GPT-4提示豐富的圖像手動標註來生成的。我們在實驗中驗證，通過在SVIT上訓練多模式模型可以顯著提高視覺感知、推理和規劃等多模式性能。

English

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

SVIT：擴展視覺指令調整

SVIT: Scaling up Visual Instruction Tuning

摘要

Support