REPA-E: 潜在拡散変換器を用いたエンドツーエンドチューニングのためのVAEの解放

要旨

本論文では、根本的な疑問に取り組む：「潜在拡散モデルを変分自己符号化器（VAE）トークナイザーとエンドツーエンドで同時に訓練することは可能か？」従来の深層学習の知見では、可能な限りエンドツーエンド訓練が望ましいとされている。しかし、潜在拡散トランスフォーマーにおいては、標準的な拡散損失を用いてVAEと拡散モデルを同時に訓練することは効果的でなく、最終的な性能の低下を引き起こすことが観察されている。我々は、拡散損失が効果的でない一方で、表現アライメント（REPA）損失を通じてエンドツーエンド訓練が可能になることを示す。これにより、訓練プロセス中にVAEと拡散モデルの両方を同時に調整することが可能となる。提案された訓練手法（REPA-E）は単純ながらも驚くべき性能を示し、拡散モデルの訓練速度をREPAおよび従来の訓練手法と比較してそれぞれ17倍以上、45倍以上高速化する。興味深いことに、REPA-Eを用いたエンドツーエンド調整はVAE自体も改善し、潜在空間の構造と下流の生成性能が向上する。最終的な性能において、我々のアプローチは新たな最先端を達成し、ImageNet 256×256において分類器不要ガイダンスの有無でそれぞれFID 1.26および1.83を達成した。コードはhttps://end2end-diffusion.github.ioで公開されている。

English

In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

REPA-E: 潜在拡散変換器を用いたエンドツーエンドチューニングのためのVAEの解放

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

要旨

Support