LiteVAE：潜在拡散モデルのための軽量で効率的な変分オートエンコーダ

要旨

潜在拡散モデル（LDMs）の進展は高解像度画像生成に革命をもたらしましたが、これらのシステムの中核をなすオートエンコーダの設計空間は未だ十分に探索されていません。本論文では、2次元離散ウェーブレット変換を活用し、標準的な変分オートエンコーダ（VAEs）と比較してスケーラビリティと計算効率を向上させつつ、出力品質を損なわないLiteVAEというオートエンコーダのファミリーを紹介します。また、LiteVAEの訓練方法論とデコーダアーキテクチャを調査し、訓練ダイナミクスと再構成品質を改善するいくつかの拡張を提案します。我々のベースLiteVAEモデルは、エンコーダのパラメータ数を6分の1に削減しつつ、現在のLDMsで確立されたVAEsと同等の品質を達成し、より高速な訓練と低いGPUメモリ要件を実現します。一方、より大規模なモデルは、評価されたすべての指標（rFID、LPIPS、PSNR、SSIM）において、同等の複雑さを持つVAEsを上回る性能を示しました。

English

Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a family of autoencoders for LDMs that leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality. We also investigate the training methodologies and the decoder architecture of LiteVAE and propose several enhancements that improve the training dynamics and reconstruction quality. Our base LiteVAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (rFID, LPIPS, PSNR, and SSIM).

LiteVAE：潜在拡散モデルのための軽量で効率的な変分オートエンコーダ

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

要旨

Support