MAGID:用於生成合成多模態數據集的自動化流程
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
March 5, 2024
作者: Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour
cs.AI
摘要
多模互動系統的發展受限於缺乏豐富的多模(文本、圖像)對話數據,這對於大型語言模型(LLMs)是必要的。先前的方法通常通過擴充文本對話以檢索到的圖像來解決問題,但存在隱私、多樣性和質量方面的限制。在這項工作中,我們引入了多模擴增生成圖像對話(MAGID),這是一個框架,用於將僅文本對話與多樣且高質量的圖像相結合。隨後,應用擴散模型來製作相應的圖像,確保與識別的文本保持一致。最後,MAGID結合了一個創新的反饋循環,介於圖像描述生成模塊(文本LLM)和圖像質量模塊之間(涉及美學、圖像文本匹配和安全性),它們協同工作以生成高質量和多模式對話。我們在三個對話數據集上將MAGID與其他最先進的基準線進行比較,使用自動化和人工評估。我們的結果表明,MAGID與基準線相當或更好,人工評估方面有顯著改進,特別是在圖像數據庫較小的情況下與檢索基準線相比。
English
Development of multimodal interactive systems is hindered by the lack of
rich, multimodal (text, images) conversational data, which is needed in large
quantities for LLMs. Previous approaches augment textual dialogues with
retrieved images, posing privacy, diversity, and quality constraints. In this
work, we introduce Multimodal Augmented Generative
Images Dialogues (MAGID), a framework to augment text-only
dialogues with diverse and high-quality images. Subsequently, a diffusion model
is applied to craft corresponding images, ensuring alignment with the
identified text. Finally, MAGID incorporates an innovative feedback loop
between an image description generation module (textual LLM) and image quality
modules (addressing aesthetics, image-text matching, and safety), that work in
tandem to generate high-quality and multi-modal dialogues. We compare MAGID to
other SOTA baselines on three dialogue datasets, using automated and human
evaluation. Our results show that MAGID is comparable to or better than
baselines, with significant improvements in human evaluation, especially
against retrieval baselines where the image database is small.