ChatPaper.aiChatPaper

康定斯基:融合影像先驗與潛在擴散模型的改進型文字轉圖像生成技術

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

October 5, 2023
作者: Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov
cs.AI

摘要

文字到圖像生成是現代電腦視覺的重要領域,隨著生成式架構的演進已實現顯著進步。其中,基於擴散技術的模型展現了關鍵的品質提升,這類模型通常分為像素級與潛在空間級兩種方法。本文提出Kandinsky1——一種對潛在擴散架構的新穎探索,融合了圖像先驗模型原理與潛在擴散技術。該圖像先驗模型經獨立訓練,可將文字嵌入向量映射至CLIP的圖像嵌入向量。此模型的另一特點是改進版的MoVQ實現,作為圖像自動編碼器組件。整體設計模型包含33億參數。我們同時部署了用戶友好的演示系統,支援多種生成模式,包括文字到圖像生成、圖像融合、文圖混合生成、圖像變體生成以及文字引導的修補/擴繪功能。此外,我們開源了Kandinsky模型的原始碼與訓練檢查點。實驗評估顯示在COCO-30K資料集上達成8.03的FID分數,標誌著我們的模型在可量化的圖像生成品質方面成為頂尖開源方案。
English
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
PDF785December 14, 2025