MAGID：用于生成合成多模态数据集的自动化流水线

摘要

多模交互系统的发展受到富含多模（文本、图像）对话数据的缺乏阻碍，这种数据在大量情况下对于LLMs是必需的。先前的方法通过检索图像来增强文本对话，但存在隐私、多样性和质量约束。在这项工作中，我们引入了多模增强生成图像对话（MAGID），这是一个框架，用于将仅文本对话与多样化和高质量的图像相结合。随后，应用扩散模型来生成相应的图像，确保与识别的文本保持一致。最后，MAGID融入了一个创新的反馈循环，介于图像描述生成模块（文本LLM）和图像质量模块之间（涉及美学、图像文本匹配和安全性），二者协同工作生成高质量和多模对话。我们在三个对话数据集上将MAGID与其他SOTA基线进行比较，使用自动化和人工评估。我们的结果表明，MAGID与基线相当或更好，在人工评估中有显著改进，特别是在图像数据库较小的检索基线方面。

English

Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce Multimodal Augmented Generative Images Dialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

MAGID：用于生成合成多模态数据集的自动化流水线

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

摘要

Support