ChatPaper.aiChatPaper

像素空间中的潜在扩散模型后训练

Pixel-Space Post-Training of Latent Diffusion Models

September 26, 2024
作者: Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu, Sam Tsai, Peter Vajda, Zijian He, Jialiang Wang
cs.AI

摘要

最近几年,潜在扩散模型(LDMs)在图像生成领域取得了重大进展。LDMs的一个主要优势是能够在压缩的潜在空间中运行,从而实现更高效的训练和部署。然而,尽管具有这些优势,LDMs仍然面临挑战。例如,观察到LDMs经常无法完美地生成高频细节和复杂构图。我们假设造成这些缺陷的一个原因是所有LDMs的预训练和后训练都是在潜在空间中进行的,而潜在空间通常比输出图像低8倍的空间分辨率。为了解决这个问题,我们建议在后训练过程中添加像素空间监督,以更好地保留高频细节。实验证明,添加像素空间目标显著改善了基于偏好的后训练和有监督质量微调,大幅提升了最先进的DiT变换器和U-Net扩散模型在视觉质量和视觉缺陷指标上的表现,同时保持了相同的文本对齐质量。
English
Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

Summary

AI-Generated Summary

PDF222November 16, 2024