無變分自編碼器的潛在擴散模型
Latent Diffusion Model without Variational Autoencoder
October 17, 2025
作者: Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
cs.AI
摘要
近期,基于扩散的视觉生成技术取得了显著进展,主要依赖于结合变分自编码器(VAE)的潜在扩散模型。尽管这一VAE+扩散范式在高保真合成方面表现有效,但其训练效率受限、推理速度缓慢,且难以广泛迁移至其他视觉任务。这些问题源于VAE潜在空间的一个关键局限:缺乏清晰的语义分离和强大的判别结构。我们的分析证实,这些特性不仅对感知和理解任务至关重要,而且对于潜在扩散模型的稳定高效训练同样不可或缺。基于这一洞见,我们提出了SVG,一种无需变分自编码器的新型潜在扩散模型,它利用自监督表示进行视觉生成。SVG通过利用冻结的DINO特征构建了一个具有明确语义可区分性的特征空间,同时通过轻量级残差分支捕捉高保真重建所需的细粒度细节。扩散模型直接在这一语义结构化的潜在空间上进行训练,以促进更高效的学习。因此,SVG实现了加速的扩散训练,支持少步采样,并提升了生成质量。实验结果进一步表明,SVG保留了底层自监督表示的语义和判别能力,为任务通用、高质量的视觉表示提供了一条原则性的路径。
English
Recent progress in diffusion-based visual generation has largely relied on
latent diffusion models with variational autoencoders (VAEs). While effective
for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited
training efficiency, slow inference, and poor transferability to broader vision
tasks. These issues stem from a key limitation of VAE latent spaces: the lack
of clear semantic separation and strong discriminative structure. Our analysis
confirms that these properties are crucial not only for perception and
understanding tasks, but also for the stable and efficient training of latent
diffusion models. Motivated by this insight, we introduce SVG, a novel latent
diffusion model without variational autoencoders, which leverages
self-supervised representations for visual generation. SVG constructs a feature
space with clear semantic discriminability by leveraging frozen DINO features,
while a lightweight residual branch captures fine-grained details for
high-fidelity reconstruction. Diffusion models are trained directly on this
semantically structured latent space to facilitate more efficient learning. As
a result, SVG enables accelerated diffusion training, supports few-step
sampling, and improves generative quality. Experimental results further show
that SVG preserves the semantic and discriminative capabilities of the
underlying self-supervised representations, providing a principled pathway
toward task-general, high-quality visual representations.