InfGen: 解像度に依存しないスケーラブルな画像合成のパラダイム

要旨

任意解像度画像生成は、デバイス間で一貫した視覚体験を提供し、生産者と消費者にとって広範な応用が可能である。現在の拡散モデルでは、解像度に応じて計算需要が二次的に増加し、4K画像生成に100秒以上の遅延が生じる。これを解決するため、我々は潜在拡散モデルの第二世代を探求し、拡散モデルによって生成された固定潜在を内容表現と見なし、コンパクトな生成潜在を用いて任意解像度画像をワンステップ生成器でデコードすることを提案する。これにより、拡散モデルの再学習を必要とせず、固定サイズの潜在から任意の解像度で画像を生成するために、VAEデコーダを新しい生成器に置き換えたInfGenを提示する。この手法はプロセスを簡素化し、計算複雑性を低減し、同じ潜在空間を使用する任意のモデルに適用可能である。実験結果から、InfGenは多くのモデルを任意の高解像度時代に進化させ、4K画像生成時間を10秒未満に短縮できることが示された。

English

Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the InfGen, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.