坎丹斯基 3.0 技術報告

摘要

我們介紹 Kandinsky 3.0，一個基於潛在擴散的大規模文本到圖像生成模型，延續了一系列文本到圖像的 Kandinsky 模型，並反映了我們在實現更高質量和真實感的圖像生成方面的進展。與之前的 Kandinsky 2.x 版本相比，Kandinsky 3.0 利用了兩倍大的 U-Net 主幹，十倍大的文本編碼器，並刪除了擴散映射。我們描述了模型的架構、數據收集程序、訓練技術和用戶交互的生產系統。我們專注於關鍵組件，這些組件是我們通過大量實驗確定的，與其他模型相比對提高我們模型質量影響最顯著。通過我們的並排比較，Kandinsky 在文本理解方面變得更好，並在特定領域上運作更好。項目頁面：https://ai-forever.github.io/Kandinsky-3

English

We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3

坎丹斯基 3.0 技術報告

Kandinsky 3.0 Technical Report

摘要

Support