RAPHAEL: 대규모 확산 경로 혼합을 통한 텍스트-이미지 생성

초록

텍스트-이미지 생성 분야는 최근 놀라운 성과를 보여주고 있습니다. 우리는 고도로 예술적인 이미지를 생성하는 텍스트 조건부 이미지 확산 모델인 RAPHAEL을 소개합니다. 이 모델은 여러 명사, 형용사, 동사를 포함한 텍스트 프롬프트를 정확하게 묘사합니다. 이는 수십 개의 전문가 혼합(MoE) 레이어, 즉 공간-MoE와 시간-MoE 레이어를 쌓아 네트워크 입력에서 출력까지 수십억 개의 확산 경로(루트)를 가능하게 함으로써 달성됩니다. 각 경로는 직관적으로 특정 텍스트 개념을 지정된 이미지 영역에 확산 시간 단계에서 그리는 "화가" 역할을 합니다. 포괄적인 실험 결과, RAPHAEL은 Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, DALL-E 2와 같은 최신 최첨단 모델을 이미지 품질과 미적 매력 측면에서 능가하는 것으로 나타났습니다. 첫째, RAPHAEL은 일본 만화, 리얼리즘, 사이버펑크, 잉크 일러스트레이션과 같은 다양한 스타일 간 이미지 전환에서 우수한 성능을 보입니다. 둘째, 1,000개의 A100 GPU에서 두 달 동안 훈련된 30억 개의 파라미터를 가진 단일 모델은 COCO 데이터셋에서 6.61의 최첨단 제로샷 FID 점수를 달성했습니다. 또한, RAPHAEL은 ViLG-300 벤치마크에서 인간 평가에서도 경쟁 모델들을 크게 앞섭니다. 우리는 RAPHAEL이 학계와 산업계 모두에서 이미지 생성 연구의 최전선을 이끌어 이 빠르게 진화하는 분야의 미래 돌파구를 열어줄 잠재력을 가지고 있다고 믿습니다. 더 자세한 내용은 프로젝트 웹페이지(https://raphael-painter.github.io/)에서 확인할 수 있습니다.

English

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a project webpage: https://raphael-painter.github.io/.

RAPHAEL: 대규모 확산 경로 혼합을 통한 텍스트-이미지 생성

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

초록

Support