PanGu-Draw：時間分離型トレーニングと再利用可能なCoop-Diffusionによるリソース効率の高いテキストから画像への合成の進展

要旨

現在の大規模拡散モデルは、テキスト、人間のポーズ、エッジなど多様な手がかりを解釈可能な条件付き画像合成において飛躍的な進歩を遂げています。しかし、その実現には膨大な計算資源と広範なデータ収集が必要であり、これがボトルネックとなっています。一方で、異なる制御に特化し、独自の潜在空間で動作する既存の拡散モデルを統合することは、互換性のない画像解像度や潜在空間埋め込み構造のため困難であり、それらの共同使用を妨げています。これらの制約に対処するため、我々は複数の制御信号を巧みに扱うリソース効率の良いテキスト・画像合成向けの新しい潜在拡散モデル「PanGu-Draw」を提案します。まず、リソース効率の良いTime-Decoupling Training Strategyを提案し、テキスト・画像モデルを構造生成器とテクスチャ生成器に分割します。各生成器はデータ利用と計算効率を最大化する訓練方法で訓練され、データ準備を48%削減し、訓練リソースを51%削減します。次に、異なる潜在空間と事前定義された解像度を持つ様々な事前訓練済み拡散モデルを統一的なノイズ除去プロセス内で協調的に使用可能にする「Coop-Diffusion」アルゴリズムを導入します。これにより、追加データや再訓練を必要とせずに任意の解像度でのマルチ制御画像合成が可能になります。PanGu-Drawの実証実験は、テキスト・画像合成およびマルチ制御画像生成における卓越した能力を示し、将来のモデル訓練効率と生成の多様性に向けた有望な方向性を示唆しています。最大規模の5B T2I PanGu-DrawモデルはAscendプラットフォームで公開されています。プロジェクトページ: https://pangu-draw.github.io

English

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: https://pangu-draw.github.io

PanGu-Draw：時間分離型トレーニングと再利用可能なCoop-Diffusionによるリソース効率の高いテキストから画像への合成の進展

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

要旨

Support