文本生成器：您的文本编码器可以成为图像质量控制器

摘要

基于扩散的文本到图像生成模型，例如稳定扩散，已经在内容生成领域引起了革命性变革，实现了图像编辑和视频合成等领域的重大进展。尽管这些模型具有强大的能力，但它们并非没有局限性。合成与输入文本良好对齐的图像仍然具有挑战性，需要多次运行并使用精心设计的提示才能获得令人满意的结果。为了减轻这些局限性，许多研究努力对预训练的扩散模型，即UNet，进行微调，利用各种技术。然而，在这些努力中，一个重要的问题一直未被深入探讨：是否可能并且可行通过微调文本编码器来提高文本到图像扩散模型的性能？我们的研究结果表明，与其用其他大型语言模型替换稳定扩散中使用的CLIP文本编码器，我们可以通过我们提出的微调方法TextCraftor 来增强它，从而在定量基准和人类评估中实现实质性改进。有趣的是，我们的技术还通过插值不同经过奖励微调的文本编码器，实现了可控图像生成。我们还证明了TextCraftor 与UNet微调是正交的，并且可以结合以进一步提高生成质量。

English

Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.

文本生成器：您的文本编码器可以成为图像质量控制器

TextCraftor: Your Text Encoder Can be Image Quality Controller

摘要

Support