CoMat:將文本到圖像擴散模型與圖像到文本概念對齊
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
April 4, 2024
作者: Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li
cs.AI
摘要
擴散模型在文本到圖像生成領域取得了巨大成功。然而,緩解文本提示與圖像之間的不一致仍然具有挑戰性。對於不一致的根本原因尚未得到廣泛的探討。我們觀察到,不一致是由於令牌注意力激活不足所致。我們進一步將這一現象歸因於擴散模型的條件利用不足,這是由其訓練範式引起的。為了解決問題,我們提出了CoMat,這是一種端對端的擴散模型微調策略,具有圖像到文本概念匹配機制。我們利用圖像字幕模型來衡量圖像到文本的對齊情況,並引導擴散模型重新訪問被忽略的令牌。同時,我們還提出了一個新的屬性集中模組來解決屬性綁定問題。在沒有圖像或人類偏好數據的情況下,我們僅使用了2萬個文本提示來微調SDXL,獲得了CoMat-SDXL。大量實驗表明,CoMat-SDXL在兩個文本到圖像對齊基準測試中顯著優於基線模型SDXL,並取得了最先進的性能。
English
Diffusion models have demonstrated great success in the field of
text-to-image generation. However, alleviating the misalignment between the
text prompts and images is still challenging. The root reason behind the
misalignment has not been extensively investigated. We observe that the
misalignment is caused by inadequate token attention activation. We further
attribute this phenomenon to the diffusion model's insufficient condition
utilization, which is caused by its training paradigm. To address the issue, we
propose CoMat, an end-to-end diffusion model fine-tuning strategy with an
image-to-text concept matching mechanism. We leverage an image captioning model
to measure image-to-text alignment and guide the diffusion model to revisit
ignored tokens. A novel attribute concentration module is also proposed to
address the attribute binding problem. Without any image or human preference
data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL.
Extensive experiments show that CoMat-SDXL significantly outperforms the
baseline model SDXL in two text-to-image alignment benchmarks and achieves
start-of-the-art performance.Summary
AI-Generated Summary