Img-Diff:用于多模态大型语言模型的对比数据合成
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
August 8, 2024
作者: Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen
cs.AI
摘要
高性能多模态大型语言模型(MLLMs)在很大程度上依赖数据质量。本研究介绍了一个名为Img-Diff的新数据集,旨在通过利用对比学习和图像差异描述的见解,增强MLLMs中的细粒度图像识别能力。通过分析相似图像之间的物体差异,我们挑战模型识别匹配和不同组件。我们利用Stable-Diffusion-XL模型和先进的图像编辑技术创建了突出物体替换的相似图像对。我们的方法包括用于识别物体差异的差异区域生成器,随后是用于详细差异描述的差异标题生成器。结果是一个相对较小但高质量的“物体替换”样本数据集。我们使用提出的数据集对最先进的MLLMs(如MGM-7B)进行微调,使性能得分全面提升,超过了使用更大规模数据集训练的最先进模型,在许多图像差异和视觉问答任务中。例如,我们训练的模型显著超越了SOTA模型GPT-4V和Gemini在MMVP基准测试上。此外,我们研究了通过“物体移除”生成图像差异数据的替代方法,并进行了彻底评估以确认数据集的多样性、质量和稳健性,提出了有关这种对比数据合成的几点见解。为了鼓励进一步研究并推动多模态数据合成领域以及增强MLLMs对图像理解的基本能力,我们在https://github.com/modelscope/data-juicer/tree/ImgDiff发布了我们的代码和数据集。
English
High-performance Multimodal Large Language Models (MLLMs) rely heavily on
data quality. This study introduces a novel dataset named Img-Diff, designed to
enhance fine-grained image recognition in MLLMs by leveraging insights from
contrastive learning and image difference captioning. By analyzing object
differences between similar images, we challenge models to identify both
matching and distinct components. We utilize the Stable-Diffusion-XL model and
advanced image editing techniques to create pairs of similar images that
highlight object replacements. Our methodology includes a Difference Area
Generator for object differences identifying, followed by a Difference Captions
Generator for detailed difference descriptions. The result is a relatively
small but high-quality dataset of "object replacement" samples. We use the the
proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B,
yielding comprehensive improvements of performance scores over SOTA models that
trained with larger-scale datasets, in numerous image difference and Visual
Question Answering tasks. For instance, our trained models notably surpass the
SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate
alternative methods for generating image difference data through "object
removal" and conduct thorough evaluation to confirm the dataset's diversity,
quality, and robustness, presenting several insights on synthesis of such
contrastive dataset. To encourage further research and advance the field of
multimodal data synthesis and enhancement of MLLMs' fundamental capabilities
for image understanding, we release our codes and dataset at
https://github.com/modelscope/data-juicer/tree/ImgDiff.Summary
AI-Generated Summary