通过自监督预训练推进端到端像素空间生成建模
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
October 14, 2025
作者: Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu
cs.AI
摘要
像素空间生成模型通常训练难度较大,且普遍表现逊色于其潜在空间对应模型,导致存在持续的性能与效率差距。本文提出了一种新颖的两阶段训练框架,旨在为像素空间扩散模型和一致性模型弥合这一差距。在第一阶段,我们预训练编码器以从干净图像中捕捉有意义的语义,同时将这些语义与沿同一确定性采样轨迹的点对齐,该轨迹将点从先验分布演化至数据分布。第二阶段,我们将编码器与随机初始化解码器整合,并对完整模型进行端到端的微调,适用于扩散模型和一致性模型。我们的训练框架在ImageNet数据集上展现了强大的实证性能。具体而言,我们的扩散模型在ImageNet-256和ImageNet-512上分别以75次函数评估(NFE)达到了2.04和2.35的FID分数,在生成质量和效率上均大幅超越先前的像素空间方法,同时在与可比的训练成本下,与领先的基于VAE的模型相媲美。此外,在ImageNet-256上,我们的一致性模型在单次采样步骤中实现了8.82的FID,显著超越了其潜在空间对应模型。据我们所知,这标志着首次成功直接在高分辨率图像上训练一致性模型,而无需依赖预训练的VAE或扩散模型。
English
Pixel-space generative models are often more difficult to train and generally
underperform compared to their latent-space counterparts, leaving a persistent
performance and efficiency gap. In this paper, we introduce a novel two-stage
training framework that closes this gap for pixel-space diffusion and
consistency models. In the first stage, we pre-train encoders to capture
meaningful semantics from clean images while aligning them with points along
the same deterministic sampling trajectory, which evolves points from the prior
to the data distribution. In the second stage, we integrate the encoder with a
randomly initialized decoder and fine-tune the complete model end-to-end for
both diffusion and consistency models. Our training framework demonstrates
strong empirical performance on ImageNet dataset. Specifically, our diffusion
model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75
number of function evaluations (NFE), surpassing prior pixel-space methods by a
large margin in both generation quality and efficiency while rivaling leading
VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our
consistency model achieves an impressive FID of 8.82 in a single sampling step,
significantly surpassing its latent-space counterpart. To the best of our
knowledge, this marks the first successful training of a consistency model
directly on high-resolution images without relying on pre-trained VAEs or
diffusion models.