Img-Diff: マルチモーダル大規模言語モデルのための対照的データ合成

要旨

高性能なマルチモーダル大規模言語モデル（MLLM）は、データの品質に大きく依存しています。本研究では、コントラスト学習と画像差分キャプショニングの知見を活用して、MLLMの細粒度画像認識を強化するために設計された新しいデータセット「Img-Diff」を紹介します。類似画像間のオブジェクトの差異を分析することで、モデルに一致する部分と異なる部分の両方を識別することを求めます。Stable-Diffusion-XLモデルと高度な画像編集技術を利用して、オブジェクトの置換を強調した類似画像ペアを作成します。私たちの手法には、オブジェクトの差異を特定するための「Difference Area Generator」と、詳細な差異の説明を生成する「Difference Captions Generator」が含まれます。その結果、比較的小規模ながら高品質な「オブジェクト置換」サンプルのデータセットが得られます。提案されたデータセットを使用して、MGM-7Bなどの最先端（SOTA）MLLMをファインチューニングし、大規模データセットでトレーニングされたSOTAモデルを上回る性能スコアの包括的な改善を、多数の画像差分および視覚的質問応答タスクで達成しました。例えば、私たちのトレーニング済みモデルは、MMVPベンチマークにおいてSOTAモデルであるGPT-4VやGeminiを顕著に上回りました。さらに、「オブジェクト削除」を通じて画像差分データを生成する代替方法を調査し、データセットの多様性、品質、堅牢性を確認するための徹底的な評価を行い、そのようなコントラストデータセットの合成に関するいくつかの洞察を提示します。マルチモーダルデータ合成とMLLMの画像理解の基本的な能力の向上を促進するため、私たちはコードとデータセットをhttps://github.com/modelscope/data-juicer/tree/ImgDiffで公開しています。

English

High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

Img-Diff: マルチモーダル大規模言語モデルのための対照的データ合成

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

要旨

Support