Kandinsky 3.0 技術レポート

要旨

私たちは、潜在拡散に基づく大規模なテキストから画像生成モデルであるKandinsky 3.0を発表します。これは、テキストから画像生成のKandinskyシリーズを継続し、画像生成の品質とリアリズムを向上させるための進展を反映しています。Kandinsky 2.xの以前のバージョンと比較して、Kandinsky 3.0は2倍大きいU-Netバックボーン、10倍大きいテキストエンコーダーを採用し、拡散マッピングを削除しました。本モデルのアーキテクチャ、データ収集手順、トレーニング技術、およびユーザーインタラクションのプロダクションシステムについて説明します。私たちは、多数の実験の結果として特定した、他のモデルと比較して品質向上に最も大きな影響を与えた主要なコンポーネントに焦点を当てます。私たちの並列比較により、Kandinskyはテキスト理解が向上し、特定のドメインでより良く機能するようになりました。プロジェクトページ: https://ai-forever.github.io/Kandinsky-3

English

We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3

Kandinsky 3.0 技術レポート

Kandinsky 3.0 Technical Report

要旨

Support