MAGID：合成マルチモーダルデータセットを自動生成するパイプライン

要旨

マルチモーダル対話システムの開発は、大規模言語モデル（LLM）に必要な豊富なマルチモーダル（テキスト、画像）会話データの不足によって妨げられています。従来のアプローチでは、テキスト対話に検索された画像を追加することで、プライバシー、多様性、品質の制約が生じていました。本研究では、テキストのみの対話を多様で高品質な画像で拡張するためのフレームワークであるMultimodal Augmented Generative Images Dialogues（MAGID）を提案します。その後、拡散モデルを適用して対応する画像を作成し、特定されたテキストとの整合性を確保します。最後に、MAGIDは、画像説明生成モジュール（テキストLLM）と画像品質モジュール（美的感覚、画像とテキストの一致、安全性を扱う）との間の革新的なフィードバックループを組み込み、高品質でマルチモーダルな対話を生成します。MAGIDを3つの対話データセットで他のSOTAベースラインと比較し、自動評価と人間評価を使用します。結果は、MAGIDがベースラインと同等またはそれ以上であり、特に画像データベースが小さい検索ベースラインに対して、人間評価において大幅な改善を示しています。

English

Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce Multimodal Augmented Generative Images Dialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

MAGID：合成マルチモーダルデータセットを自動生成するパイプライン

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

要旨

Support