ChatPaper.aiChatPaper

Img-Diff:用於多模態大型語言模型的對比數據合成

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

August 8, 2024
作者: Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen
cs.AI

摘要

高性能多模式大型語言模型(MLLMs)在很大程度上依賴於數據質量。本研究介紹了一個名為Img-Diff的新型數據集,旨在通過利用對比學習和圖像差異標註的見解,提升MLLMs中的細粒度圖像識別能力。通過分析相似圖像之間的對象差異,我們挑戰模型識別匹配和不同組件。我們利用Stable-Diffusion-XL模型和先進的圖像編輯技術創建了突出對象替換的相似圖像對。我們的方法包括用於識別對象差異的差異區域生成器,隨後是用於詳細差異描述的差異標註生成器。結果是一個相對較小但高質量的“對象替換”樣本數據集。我們使用提出的數據集來微調最先進的MLLMs,如MGM-7B,在眾多圖像差異和視覺問答任務中,性能得分全面優於使用更大規模數據集訓練的最先進模型,例如我們訓練的模型在MMVP基準測試中明顯超越了GPT-4V和Gemini等最先進模型。此外,我們研究了通過“對象刪除”生成圖像差異數據的替代方法,並進行了全面評估以確認數據集的多樣性、質量和韌性,提出了有關合成此類對比數據集的幾點見解。為了鼓勵進一步研究並推動多模式數據合成和增強MLLMs對圖像理解基本能力的領域發展,我們在https://github.com/modelscope/data-juicer/tree/ImgDiff上發布了我們的代碼和數據集。
English
High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

Summary

AI-Generated Summary

PDF152November 28, 2024