ChatPaper.aiChatPaper

FIRE:用于多模态模型反馈集成和细化评估的数据集

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

July 16, 2024
作者: Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash Harandi, Yunde Jia, Song-Chun Zhu, Qing Li
cs.AI

摘要

视觉语言模型(VLMs)在各种应用中取得了令人瞩目的进展,成为一种普遍的研究方向。本文中,我们构建了一个名为FIRE的反馈-细化数据集,包含110万个多轮对话,这些对话源自27个数据集,使VLMs能够根据用户反馈跨不同任务自动细化其回应。为了扩大数据收集规模,FIRE分为两个部分:FIRE-100K和FIRE-1M,其中FIRE-100K由GPT-4V生成,而FIRE-1M则通过在FIRE-100K上训练的模型自由生成。然后,我们构建了FIRE-Bench,一个用于全面评估VLMs反馈细化能力的基准,其中包含11K个反馈细化对话作为测试数据,两种评估设置以及一个为VLMs提供反馈的模型。我们通过在FIRE-100K和FIRE-1M上微调LLaVA来开发FIRE-LLaVA模型,该模型在FIRE-Bench上展现出显著的反馈细化能力,比未经训练的VLMs表现提高了50%,使用户-代理交互更加高效,并突显了FIRE数据集的重要性。
English
Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.

Summary

AI-Generated Summary

PDF92November 28, 2024