MAGID：用於生成合成多模態數據集的自動化流程

摘要

多模互動系統的發展受限於缺乏豐富的多模（文本、圖像）對話數據，這對於大型語言模型（LLMs）是必要的。先前的方法通常通過擴充文本對話以檢索到的圖像來解決問題，但存在隱私、多樣性和質量方面的限制。在這項工作中，我們引入了多模擴增生成圖像對話（MAGID），這是一個框架，用於將僅文本對話與多樣且高質量的圖像相結合。隨後，應用擴散模型來製作相應的圖像，確保與識別的文本保持一致。最後，MAGID結合了一個創新的反饋循環，介於圖像描述生成模塊（文本LLM）和圖像質量模塊之間（涉及美學、圖像文本匹配和安全性），它們協同工作以生成高質量和多模式對話。我們在三個對話數據集上將MAGID與其他最先進的基準線進行比較，使用自動化和人工評估。我們的結果表明，MAGID與基準線相當或更好，人工評估方面有顯著改進，特別是在圖像數據庫較小的情況下與檢索基準線相比。

English

Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce Multimodal Augmented Generative Images Dialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

MAGID：用於生成合成多模態數據集的自動化流程

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

摘要

Support