Kandinsky：利用圖像先驗和潛在擴散改進的文本到圖像合成

摘要

文字到圖像生成是現代計算機視覺中一個重要領域，通過生成架構的演進取得了顯著的改進。在這些模型中，有基於擴散的模型已經證明了重要的質量提升。這些模型通常分為兩個類別：像素級和潛在級方法。我們提出Kandinsky1，這是一種新穎的潛在擴散架構探索，結合了圖像先驗模型的原則與潛在擴散技術。圖像先驗模型被單獨訓練，將文本嵌入映射到CLIP的圖像嵌入。所提出模型的另一個獨特特徵是修改後的MoVQ實現，它作為圖像自編碼器組件。整體而言，所設計的模型包含33億參數。我們還部署了一個用戶友好的演示系統，支持多種生成模式，如文字到圖像生成、圖像融合、文字和圖像融合、圖像變化生成，以及文字引導的修補/超出修補。此外，我們釋放了Kandinsky模型的源代碼和檢查點。實驗評估在COCO-30K數據集上展示了8.03的FID分數，使我們的模型成為在可測量圖像生成質量方面頂尖的開源表現者。

English

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

Kandinsky：利用圖像先驗和潛在擴散改進的文本到圖像合成

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

摘要

Support