ChatPaper.aiChatPaper

文本製作者:您的文本編碼器可以成為影像品質控制器

TextCraftor: Your Text Encoder Can be Image Quality Controller

March 27, 2024
作者: Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren
cs.AI

摘要

基於擴散的文本到圖像生成模型,例如穩定擴散,已經在內容生成領域引起了革命性變革,實現了在圖像編輯和視頻合成等領域的重大進展。儘管這些模型具有強大的能力,但它們並非沒有局限性。合成與輸入文本相符的圖像仍然具有挑戰性,需要多次運行並使用精心設計的提示才能獲得滿意的結果。為了減輕這些限制,許多研究努力對預訓練的擴散模型,即UNet,進行微調,利用各種技術。然而,在這些努力中,一個重要的問題一直未被深入探討:是否可能且可行通過微調文本編碼器來改善文本到圖像擴散模型的性能?我們的研究結果顯示,與其將穩定擴散中使用的CLIP文本編碼器替換為其他大型語言模型,我們可以通過我們提出的微調方法TextCraftor 來增強它,從而在定量基準和人類評估方面實現顯著改進。有趣的是,我們的技術還通過將經過各種獎勵微調的不同文本編碼器進行插值,實現了可控的圖像生成。我們還展示了TextCraftor 與UNet微調是正交的,可以結合使用以進一步提高生成質量。
English
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.

Summary

AI-Generated Summary

PDF151December 15, 2024