像素空间中的潜在扩散模型后训练
Pixel-Space Post-Training of Latent Diffusion Models
September 26, 2024
作者: Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu, Sam Tsai, Peter Vajda, Zijian He, Jialiang Wang
cs.AI
摘要
最近几年,潜在扩散模型(LDMs)在图像生成领域取得了重大进展。LDMs的一个主要优势是能够在压缩的潜在空间中运行,从而实现更高效的训练和部署。然而,尽管具有这些优势,LDMs仍然面临挑战。例如,观察到LDMs经常无法完美地生成高频细节和复杂构图。我们假设造成这些缺陷的一个原因是所有LDMs的预训练和后训练都是在潜在空间中进行的,而潜在空间通常比输出图像低8倍的空间分辨率。为了解决这个问题,我们建议在后训练过程中添加像素空间监督,以更好地保留高频细节。实验证明,添加像素空间目标显著改善了基于偏好的后训练和有监督质量微调,大幅提升了最先进的DiT变换器和U-Net扩散模型在视觉质量和视觉缺陷指标上的表现,同时保持了相同的文本对齐质量。
English
Latent diffusion models (LDMs) have made significant advancements in the
field of image generation in recent years. One major advantage of LDMs is their
ability to operate in a compressed latent space, allowing for more efficient
training and deployment. However, despite these advantages, challenges with
LDMs still remain. For example, it has been observed that LDMs often generate
high-frequency details and complex compositions imperfectly. We hypothesize
that one reason for these flaws is due to the fact that all pre- and
post-training of LDMs are done in latent space, which is typically 8 times 8
lower spatial-resolution than the output images. To address this issue, we
propose adding pixel-space supervision in the post-training process to better
preserve high-frequency details. Experimentally, we show that adding a
pixel-space objective significantly improves both supervised quality
fine-tuning and preference-based post-training by a large margin on a
state-of-the-art DiT transformer and U-Net diffusion models in both visual
quality and visual flaw metrics, while maintaining the same text alignment
quality.Summary
AI-Generated Summary