透過自我監督預訓練推進端到端像素空間生成建模
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
October 14, 2025
作者: Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu
cs.AI
摘要
像素空間生成模型相較於其潛在空間對應模型,通常更難訓練且表現普遍較差,這導致了持續的性能與效率差距。本文提出了一種新穎的兩階段訓練框架,旨在為像素空間擴散模型和一致性模型彌合這一差距。在第一階段,我們預訓練編碼器以從乾淨圖像中捕捉有意義的語義,同時將這些語義與沿同一確定性採樣軌跡的點對齊,該軌跡將點從先驗分佈演變至數據分佈。在第二階段,我們將編碼器與隨機初始化的解碼器整合,並對完整模型進行端到端的微調,適用於擴散模型和一致性模型。我們的訓練框架在ImageNet數據集上展現了強大的實證性能。具體而言,我們的擴散模型在ImageNet-256上達到了2.04的FID分數,在ImageNet-512上達到了2.35的FID分數,僅需75次函數評估(NFE),在生成質量和效率上均大幅超越先前的像素空間方法,並在相當的訓練成本下與領先的基於VAE的模型相媲美。此外,在ImageNet-256上,我們的一致性模型在單次採樣步驟中實現了令人印象深刻的8.82 FID分數,顯著超越了其潛在空間對應模型。據我們所知,這標誌著首次成功地在不依賴預訓練VAE或擴散模型的情況下,直接在高分辨率圖像上訓練一致性模型。
English
Pixel-space generative models are often more difficult to train and generally
underperform compared to their latent-space counterparts, leaving a persistent
performance and efficiency gap. In this paper, we introduce a novel two-stage
training framework that closes this gap for pixel-space diffusion and
consistency models. In the first stage, we pre-train encoders to capture
meaningful semantics from clean images while aligning them with points along
the same deterministic sampling trajectory, which evolves points from the prior
to the data distribution. In the second stage, we integrate the encoder with a
randomly initialized decoder and fine-tune the complete model end-to-end for
both diffusion and consistency models. Our training framework demonstrates
strong empirical performance on ImageNet dataset. Specifically, our diffusion
model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75
number of function evaluations (NFE), surpassing prior pixel-space methods by a
large margin in both generation quality and efficiency while rivaling leading
VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our
consistency model achieves an impressive FID of 8.82 in a single sampling step,
significantly surpassing its latent-space counterpart. To the best of our
knowledge, this marks the first successful training of a consistency model
directly on high-resolution images without relying on pre-trained VAEs or
diffusion models.