康定斯基 3.0 技术报告

摘要

我们介绍了Kandinsky 3.0，这是一个基于潜在扩散的大规模文本到图像生成模型，延续了一系列文本到图像的Kandinsky模型，并反映了我们在实现更高质量和真实感图像生成方面的进展。与之前的Kandinsky 2.x版本相比，Kandinsky 3.0利用了两倍大的U-Net骨干网络，十倍大的文本编码器，并移除了扩散映射。我们描述了模型的架构、数据收集过程、训练技术以及用户交互的生产系统。我们专注于关键组件，这些组件是我们通过大量实验确定的，对提高我们模型质量具有最显著影响的。通过我们的并排比较，Kandinsky在文本理解方面变得更好，并在特定领域表现更佳。项目页面：https://ai-forever.github.io/Kandinsky-3

English

We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3

康定斯基 3.0 技术报告

Kandinsky 3.0 Technical Report

摘要

Support