潜在拡散モデルのピクセル空間での事後トレーニング

要旨

最近、画像生成の分野において、潜在拡散モデル（LDMs）は大きな進歩を遂げています。LDMsの主な利点の1つは、より効率的なトレーニングと展開を可能にする、圧縮された潜在空間での動作能力です。しかしながら、これらの利点にもかかわらず、LDMsには依然として課題が残っています。例えば、LDMsはしばしば高周波の詳細や複雑な構成を不完全に生成することが観察されています。これらの欠陥の1つの原因は、LDMsの事前および事後のトレーニングが通常出力画像よりも8倍低い空間解像度である潜在空間で行われるためであると仮説を立てています。この問題に対処するため、我々は事後トレーニングプロセスにピクセル空間の監督を追加することを提案し、高周波の詳細をよりよく保持することができると考えています。実験的には、最先端のDiTトランスフォーマーおよびU-Net拡散モデルにおいて、視覚的な品質および視覚的な欠陥のメトリクスの両方において、ピクセル空間の目的の追加が、教師あり品質の微調整と選好に基づく事後トレーニングの両方を大幅に向上させることを示しますが、同じテキストの整合性の品質を維持します。

English

Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

潜在拡散モデルのピクセル空間での事後トレーニング

Pixel-Space Post-Training of Latent Diffusion Models

要旨

Support