Vision-Flan:在視覺指導調整中擴展人工標記任務
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
February 18, 2024
作者: Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang
cs.AI
摘要
儘管視覺語言模型(VLMs)作為多功能視覺助手具有卓越的能力,但現有的VLM框架中仍存在兩個重大挑戰:(1)在預訓練和視覺指導微調中缺乏任務多樣性,以及(2)在GPT-4合成指導數據中存在標註錯誤和偏見。這兩個挑戰導致問題,如泛化能力差、幻覺和災難性遺忘。為應對這些挑戰,我們構建了Vision-Flan,迄今為止最多樣化的公開可用視覺指導微調數據集,包括187個多樣化任務和1,664,261個實例,來源於學術數據集,每個任務都附帶專家撰寫的指導。此外,我們提出了一個兩階段指導微調框架,其中VLMs首先在Vision-Flan上進行微調,然後在GPT-4合成數據上進一步微調。我們發現這種兩階段微調框架顯著優於傳統的單階段視覺指導微調框架,並在廣泛的多模態評估基準上實現了最先進的性能。最後,我們進行深入分析以了解視覺指導微調,我們的研究發現:(1)GPT-4合成數據並未顯著增強VLMs的能力,而是調節模型對人類首選格式的反應;(2)少量(例如1,000個)的GPT-4合成數據可以有效地使VLM的反應與人類偏好保持一致;(3)視覺指導微調主要有助於大型語言模型(LLMs)理解視覺特徵。
English
Despite vision-language models' (VLMs) remarkable capabilities as versatile
visual assistants, two substantial challenges persist within the existing VLM
frameworks: (1) lacking task diversity in pretraining and visual instruction
tuning, and (2) annotation error and bias in GPT-4 synthesized instruction
tuning data. Both challenges lead to issues such as poor generalizability,
hallucination, and catastrophic forgetting. To address these challenges, we
construct Vision-Flan, the most diverse publicly available visual instruction
tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances
sourced from academic datasets, and each task is accompanied by an
expert-written instruction. In addition, we propose a two-stage instruction
tuning framework, in which VLMs are firstly finetuned on Vision-Flan and
further tuned on GPT-4 synthesized data. We find this two-stage tuning
framework significantly outperforms the traditional single-stage visual
instruction tuning framework and achieves the state-of-the-art performance
across a wide range of multi-modal evaluation benchmarks. Finally, we conduct
in-depth analyses to understand visual instruction tuning and our findings
reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs'
capabilities but rather modulates the model's responses to human-preferred
formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can
effectively align VLM responses with human-preference; (3) Visual instruction
tuning mainly helps large-language models (LLMs) to understand visual features.Summary
AI-Generated Summary