RAPHAEL：通过大规模扩散路径混合实现文本到图像生成

摘要

最近，文本到图像生成取得了显著的成就。我们引入了一种名为RAPHAEL的文本条件图像扩散模型，用于生成高度艺术化的图像，准确描绘文本提示，涵盖多个名词、形容词和动词。这是通过堆叠数十个专家混合模型（MoEs）层实现的，即空间MoE和时间MoE层，从网络输入到输出实现了数十亿的扩散路径（路线）。每条路径直观地充当“画家”，在扩散时间步上将特定的文本概念描绘到指定的图像区域。全面的实验显示，RAPHAEL在图像质量和审美吸引力方面优于最近的前沿模型，如稳定扩散、ERNIE-ViLG 2.0、DeepFloyd和DALL-E 2。首先，RAPHAEL在切换各种风格的图像方面表现出色，如日本漫画、写实主义、赛博朋克和水墨插画。其次，一个拥有30亿参数的单一模型，在1000台A100 GPU上训练了两个月，在COCO数据集上实现了6.61的最先进零样本FID分数。此外，RAPHAEL在ViLG-300基准上的人类评估明显超过了其竞争对手。我们相信RAPHAEL有潜力推动学术界和工业界图像生成研究的前沿，为这个快速发展的领域的未来突破铺平道路。更多详细信息请访问项目网页：https://raphael-painter.github.io/。

English

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a project webpage: https://raphael-painter.github.io/.

RAPHAEL：通过大规模扩散路径混合实现文本到图像生成

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

摘要

Support