ChatPaper.aiChatPaper

Vision-Flan:在视觉指导调整中扩展人工标记任务

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

February 18, 2024
作者: Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang
cs.AI

摘要

尽管视觉-语言模型(VLMs)作为多功能视觉助手具有显著的能力,但现有VLM框架中仍存在两个重要挑战:(1)在预训练和视觉指导微调中缺乏任务多样性,以及(2)GPT-4合成指导数据中存在的注释错误和偏见。这两个挑战导致了诸如泛化能力差、幻觉和灾难性遗忘等问题。为了解决这些挑战,我们构建了Vision-Flan,这是迄今为止最多样化的公开可用的视觉指导微调数据集,包括187个不同的任务和1,664,261个实例,这些实例来自学术数据集,每个任务都附带有专家撰写的指导。此外,我们提出了一个两阶段指导微调框架,其中VLMs首先在Vision-Flan上进行微调,然后在GPT-4合成数据上进一步微调。我们发现,这种两阶段微调框架明显优于传统的单阶段视觉指导微调框架,并在广泛的多模态评估基准上实现了最先进的性能。最后,我们进行了深入分析以了解视觉指导微调,我们的研究结果表明:(1)GPT-4合成数据并没有显著增强VLMs的能力,而是调节模型对人类首选格式的响应;(2)少量(例如1,000个)的GPT-4合成数据可以有效地使VLM的响应与人类偏好对齐;(3)视觉指导微调主要有助于大型语言模型(LLMs)理解视觉特征。
English
Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

Summary

AI-Generated Summary

PDF101December 15, 2024