PanGu-Draw: 시간 분리 학습과 재사용 가능한 Coop-Diffusion을 통해 자원 효율적인 텍스트-이미지 합성 기술 발전

초록

현재의 대규모 확산 모델(Diffusion Model)은 텍스트, 인간 자세, 윤곽선 등 다양한 단서를 해석할 수 있는 조건부 이미지 합성 분야에서 거대한 도약을 이루었습니다. 그러나 이러한 모델들은 상당한 계산 자원과 광범위한 데이터 수집에 의존해야 한다는 점이 여전히 병목 현상으로 남아 있습니다. 한편, 각기 다른 제어 기능에 특화되어 고유한 잠재 공간에서 동작하는 기존 확산 모델들을 통합하는 것은 호환되지 않는 이미지 해상도와 잠재 공간 임베딩 구조로 인해 공동 사용을 방해하는 과제로 남아 있습니다. 이러한 제약 사항을 해결하기 위해, 우리는 다중 제어 신호를 능숙하게 수용할 수 있는 자원 효율적인 텍스트-이미지 합성을 위한 새로운 잠재 확산 모델인 "판구 드로우(PanGu-Draw)"를 제안합니다. 먼저, 우리는 자원 효율적인 시간 분리 학습 전략(Time-Decoupling Training Strategy)을 제안합니다. 이 전략은 단일 텍스트-이미지 모델을 구조 생성기와 질감 생성기로 분리하며, 각 생성기는 데이터 활용과 계산 효율성을 극대화하는 방식으로 학습됩니다. 이를 통해 데이터 준비 시간을 48% 절감하고 학습 자원을 51% 줄일 수 있습니다. 둘째, 우리는 "협력 확산(Coop-Diffusion)" 알고리즘을 소개합니다. 이 알고리즘은 서로 다른 잠재 공간과 미리 정의된 해상도를 가진 다양한 사전 학습된 확산 모델들을 통합된 노이즈 제거 과정 내에서 협력적으로 사용할 수 있게 합니다. 이를 통해 추가 데이터나 재학습 없이도 임의의 해상도에서 다중 제어 이미지 합성이 가능해집니다. 판구 드로우의 실험적 검증은 텍스트-이미지 및 다중 제어 이미지 생성에서의 탁월한 성능을 보여주며, 향후 모델 학습 효율성과 생성 다양성을 위한 유망한 방향을 제시합니다. 가장 큰 50억 파라미터 텍스트-이미지 판구 드로우 모델은 Ascend 플랫폼에 공개되었습니다. 프로젝트 페이지: https://pangu-draw.github.io

English

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: https://pangu-draw.github.io

PanGu-Draw: 시간 분리 학습과 재사용 가능한 Coop-Diffusion을 통해 자원 효율적인 텍스트-이미지 합성 기술 발전

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

초록

Support