Img-Diff：用于多模态大型语言模型的对比数据合成

摘要

高性能多模态大型语言模型（MLLMs）在很大程度上依赖数据质量。本研究介绍了一个名为Img-Diff的新数据集，旨在通过利用对比学习和图像差异描述的见解，增强MLLMs中的细粒度图像识别能力。通过分析相似图像之间的物体差异，我们挑战模型识别匹配和不同组件。我们利用Stable-Diffusion-XL模型和先进的图像编辑技术创建了突出物体替换的相似图像对。我们的方法包括用于识别物体差异的差异区域生成器，随后是用于详细差异描述的差异标题生成器。结果是一个相对较小但高质量的“物体替换”样本数据集。我们使用提出的数据集对最先进的MLLMs（如MGM-7B）进行微调，使性能得分全面提升，超过了使用更大规模数据集训练的最先进模型，在许多图像差异和视觉问答任务中。例如，我们训练的模型显著超越了SOTA模型GPT-4V和Gemini在MMVP基准测试上。此外，我们研究了通过“物体移除”生成图像差异数据的替代方法，并进行了彻底评估以确认数据集的多样性、质量和稳健性，提出了有关这种对比数据合成的几点见解。为了鼓励进一步研究并推动多模态数据合成领域以及增强MLLMs对图像理解的基本能力，我们在https://github.com/modelscope/data-juicer/tree/ImgDiff发布了我们的代码和数据集。

English

High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

Img-Diff：用于多模态大型语言模型的对比数据合成

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

摘要

Support