Kandinsky：一种利用图像先验和潜在扩散改进的文本到图像合成方法

摘要

文本到图像生成是现代计算机视觉中一个重要领域，通过生成架构的演进取得了显著的改进。在这些模型中，有一类基于扩散的模型展示了重要的质量提升。这些模型通常分为两类：像素级和潜在级方法。我们提出了Kandinsky1，这是对潜在扩散架构的一次新颖探索，结合了图像先验模型的原则和潜在扩散技术。图像先验模型被单独训练，将文本嵌入映射到CLIP的图像嵌入。所提出模型的另一个独特特征是修改后的MoVQ实现，用作图像自动编码器组件。总体而言，设计的模型包含33亿参数。我们还部署了一个用户友好的演示系统，支持多种生成模式，如文本到图像生成、图像融合、文本和图像融合、图像变体生成，以及文本引导的修复/补全。此外，我们发布了Kandinsky模型的源代码和检查点。实验评估显示，在COCO-30K数据集上的FID分数为8.03，使我们的模型成为在可衡量的图像生成质量方面排名最高的开源表现者。

English

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

Kandinsky：一种利用图像先验和潜在扩散改进的文本到图像合成方法

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

摘要

Support