像素空间中的潜在扩散模型后训练

摘要

最近几年，潜在扩散模型（LDMs）在图像生成领域取得了重大进展。LDMs的一个主要优势是能够在压缩的潜在空间中运行，从而实现更高效的训练和部署。然而，尽管具有这些优势，LDMs仍然面临挑战。例如，观察到LDMs经常无法完美地生成高频细节和复杂构图。我们假设造成这些缺陷的一个原因是所有LDMs的预训练和后训练都是在潜在空间中进行的，而潜在空间通常比输出图像低8倍的空间分辨率。为了解决这个问题，我们建议在后训练过程中添加像素空间监督，以更好地保留高频细节。实验证明，添加像素空间目标显著改善了基于偏好的后训练和有监督质量微调，大幅提升了最先进的DiT变换器和U-Net扩散模型在视觉质量和视觉缺陷指标上的表现，同时保持了相同的文本对齐质量。

English

Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

像素空间中的潜在扩散模型后训练

Pixel-Space Post-Training of Latent Diffusion Models

摘要

Support