SVIT：扩展视觉指导调整

摘要

由于基础模型的出现，大型语言和视觉模型被整合以获得视觉字幕、对话、问题回答等多模态能力。尽管现有的多模态模型展示了出色的视觉理解和推理能力，但由于高质量指导调整数据的稀缺性，它们的局限性仍然大部分未被探索。为了拓展多模态能力的极限，我们提出了视觉指导调整（SVIT），通过构建包括160万对会话问题-回答（QA）和160万对复杂推理QA以及106,000个详细图像描述在内的320万视觉指导调整数据集。除了数据量之外，所提出的数据集还具有高质量和丰富多样性的特点，这是通过使用GPT-4提示丰富的图像手动注释生成的。我们凭经验证明，对SVIT上的多模态模型进行训练可以显著提高多模态性能，包括视觉感知、推理和规划。

English

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

SVIT：扩展视觉指导调整

SVIT: Scaling up Visual Instruction Tuning

摘要

Support