無變分自編碼器的潛在擴散模型

摘要

近期，基于扩散的视觉生成技术取得了显著进展，主要依赖于结合变分自编码器（VAE）的潜在扩散模型。尽管这一VAE+扩散范式在高保真合成方面表现有效，但其训练效率受限、推理速度缓慢，且难以广泛迁移至其他视觉任务。这些问题源于VAE潜在空间的一个关键局限：缺乏清晰的语义分离和强大的判别结构。我们的分析证实，这些特性不仅对感知和理解任务至关重要，而且对于潜在扩散模型的稳定高效训练同样不可或缺。基于这一洞见，我们提出了SVG，一种无需变分自编码器的新型潜在扩散模型，它利用自监督表示进行视觉生成。SVG通过利用冻结的DINO特征构建了一个具有明确语义可区分性的特征空间，同时通过轻量级残差分支捕捉高保真重建所需的细粒度细节。扩散模型直接在这一语义结构化的潜在空间上进行训练，以促进更高效的学习。因此，SVG实现了加速的扩散训练，支持少步采样，并提升了生成质量。实验结果进一步表明，SVG保留了底层自监督表示的语义和判别能力，为任务通用、高质量的视觉表示提供了一条原则性的路径。

English

Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

無變分自編碼器的潛在擴散模型

Latent Diffusion Model without Variational Autoencoder

摘要

Support