RAPHAEL：通過大量擴散路徑混合生成文本到圖像

摘要

最近，文本到圖像生成取得了顯著的成就。我們介紹了一種名為 RAPHAEL 的文本條件圖像擴散模型，用於生成高度藝術性的圖像，準確地描繪文本提示，包括多個名詞、形容詞和動詞。這是通過堆疊數十個專家混合層（MoEs）實現的，即空間-MoE 和時間-MoE 層，從網絡輸入到輸出實現了數十億的擴散路徑（路線）。每條路徑直觀地充當一位“畫家”，在擴散時間步驟上將特定的文本概念描繪到指定的圖像區域。全面的實驗顯示，RAPHAEL 在圖像質量和美學吸引力方面優於最新的尖端模型，如 Stable Diffusion、ERNIE-ViLG 2.0、DeepFloyd 和 DALL-E 2。首先，RAPHAEL 在切換不同風格的圖像方面表現出色，例如日本漫畫、寫實主義、赛博朋克和水墨插畫。其次，一個具有三十億參數的單一模型，在 1,000 個 A100 GPU 上訓練了兩個月，在 COCO 數據集上實現了 6.61 的最先進零樣本 FID 分數。此外，RAPHAEL 在 ViLG-300 基準上的人類評估中明顯超越了其競爭對手。我們相信，RAPHAEL 有潛力推動學術界和工業界圖像生成研究的前沿，為這個快速發展的領域的未來突破鋪平道路。更多詳細信息可在項目網頁上找到：https://raphael-painter.github.io/。

English

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a project webpage: https://raphael-painter.github.io/.

RAPHAEL：通過大量擴散路徑混合生成文本到圖像

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

摘要

Support