RAPHAEL: 大規模な拡散経路の混合によるテキストから画像への生成

要旨

テキストから画像生成の分野では、最近目覚ましい成果が報告されています。本論文では、高度に芸術的な画像を生成するテキスト条件付き画像拡散モデル「RAPHAEL」を提案します。このモデルは、複数の名詞、形容詞、動詞を含むテキストプロンプトを正確に描写します。これは、数十層のMixture-of-Experts（MoE）層、すなわち空間MoE層と時間MoE層を積み重ねることで実現されており、ネットワークの入力から出力まで数十億の拡散経路（ルート）を可能にします。各経路は、拡散タイムステップにおいて特定のテキスト概念を指定された画像領域に描画する「画家」として直感的に機能します。包括的な実験により、RAPHAELは、Stable Diffusion、ERNIE-ViLG 2.0、DeepFloyd、DALL-E 2などの最新の最先端モデルを、画像品質と美的魅力の両面で凌駕することが明らかになりました。まず、RAPHAELは、日本の漫画、リアリズム、サイバーパンク、インクイラストなど、多様なスタイル間での画像切り替えにおいて優れた性能を発揮します。次に、1,000台のA100 GPUで2ヶ月間トレーニングされた30億パラメータの単一モデルは、COCOデータセットにおいて6.61の最先端のゼロショットFIDスコアを達成しました。さらに、RAPHAELは、ViLG-300ベンチマークにおける人間評価においても、他のモデルを大幅に上回りました。我々は、RAPHAELが、学界と産業界の両方において画像生成研究のフロンティアを押し進め、この急速に進化する分野における将来のブレークスルーへの道を開く可能性を秘めていると考えています。詳細はプロジェクトのウェブページ（https://raphael-painter.github.io/）をご覧ください。

English

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a project webpage: https://raphael-painter.github.io/.

RAPHAEL: 大規模な拡散経路の混合によるテキストから画像への生成

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

要旨

Support