CogView3: 릴레이 디퓨전을 통한 더 정밀하고 빠른 텍스트-이미지 생성

초록

최근 텍스트-이미지 생성 시스템의 발전은 주로 확산 모델(diffusion models)에 의해 주도되어 왔습니다. 그러나 단일 단계 텍스트-이미지 확산 모델은 여전히 계산 효율성과 이미지 세부 사항의 정제 측면에서 어려움에 직면해 있습니다. 이러한 문제를 해결하기 위해, 우리는 텍스트-이미지 확산의 성능을 향상시키는 혁신적인 캐스케이드 프레임워크인 CogView3를 제안합니다. CogView3는 텍스트-이미지 생성 분야에서 릴레이 확산(relay diffusion)을 구현한 첫 번째 모델로, 먼저 저해상도 이미지를 생성한 후 릴레이 기반 초해상도(super-resolution)를 적용하여 작업을 수행합니다. 이 방법론은 경쟁력 있는 텍스트-이미지 출력을 생성할 뿐만 아니라, 훈련 및 추론 비용을 크게 절감합니다. 우리의 실험 결과는 CogView3가 현재 최첨단 오픈소스 텍스트-이미지 확산 모델인 SDXL을 인간 평가에서 77.0% 앞서며, 추론 시간은 약 1/2만 소요됨을 보여줍니다. CogView3의 증류(distilled) 버전은 SDXL의 1/10 추론 시간만 사용하면서도 비슷한 성능을 달성합니다.

English

Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.

CogView3: 릴레이 디퓨전을 통한 더 정밀하고 빠른 텍스트-이미지 생성

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

초록

Support