MAGID:用于生成合成多模态数据集的自动化流水线
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
March 5, 2024
作者: Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour
cs.AI
摘要
多模交互系统的发展受到富含多模(文本、图像)对话数据的缺乏阻碍,这种数据在大量情况下对于LLMs是必需的。先前的方法通过检索图像来增强文本对话,但存在隐私、多样性和质量约束。在这项工作中,我们引入了多模增强生成图像对话(MAGID),这是一个框架,用于将仅文本对话与多样化和高质量的图像相结合。随后,应用扩散模型来生成相应的图像,确保与识别的文本保持一致。最后,MAGID融入了一个创新的反馈循环,介于图像描述生成模块(文本LLM)和图像质量模块之间(涉及美学、图像文本匹配和安全性),二者协同工作生成高质量和多模对话。我们在三个对话数据集上将MAGID与其他SOTA基线进行比较,使用自动化和人工评估。我们的结果表明,MAGID与基线相当或更好,在人工评估中有显著改进,特别是在图像数据库较小的检索基线方面。
English
Development of multimodal interactive systems is hindered by the lack of
rich, multimodal (text, images) conversational data, which is needed in large
quantities for LLMs. Previous approaches augment textual dialogues with
retrieved images, posing privacy, diversity, and quality constraints. In this
work, we introduce Multimodal Augmented Generative
Images Dialogues (MAGID), a framework to augment text-only
dialogues with diverse and high-quality images. Subsequently, a diffusion model
is applied to craft corresponding images, ensuring alignment with the
identified text. Finally, MAGID incorporates an innovative feedback loop
between an image description generation module (textual LLM) and image quality
modules (addressing aesthetics, image-text matching, and safety), that work in
tandem to generate high-quality and multi-modal dialogues. We compare MAGID to
other SOTA baselines on three dialogue datasets, using automated and human
evaluation. Our results show that MAGID is comparable to or better than
baselines, with significant improvements in human evaluation, especially
against retrieval baselines where the image database is small.