Vision-Flan：在视觉指导调整中扩展人工标记任务

摘要

尽管视觉-语言模型（VLMs）作为多功能视觉助手具有显著的能力，但现有VLM框架中仍存在两个重要挑战：（1）在预训练和视觉指导微调中缺乏任务多样性，以及（2）GPT-4合成指导数据中存在的注释错误和偏见。这两个挑战导致了诸如泛化能力差、幻觉和灾难性遗忘等问题。为了解决这些挑战，我们构建了Vision-Flan，这是迄今为止最多样化的公开可用的视觉指导微调数据集，包括187个不同的任务和1,664,261个实例，这些实例来自学术数据集，每个任务都附带有专家撰写的指导。此外，我们提出了一个两阶段指导微调框架，其中VLMs首先在Vision-Flan上进行微调，然后在GPT-4合成数据上进一步微调。我们发现，这种两阶段微调框架明显优于传统的单阶段视觉指导微调框架，并在广泛的多模态评估基准上实现了最先进的性能。最后，我们进行了深入分析以了解视觉指导微调，我们的研究结果表明：（1）GPT-4合成数据并没有显著增强VLMs的能力，而是调节模型对人类首选格式的响应；（2）少量（例如1,000个）的GPT-4合成数据可以有效地使VLM的响应与人类偏好对齐；（3）视觉指导微调主要有助于大型语言模型（LLMs）理解视觉特征。

English

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

Vision-Flan：在视觉指导调整中扩展人工标记任务

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

摘要

Support