什么决定了扩散友好的潜流形？面向潜扩散的先验对齐自编码器

摘要

分词器是潜扩散模型的关键组成部分，它们定义了扩散模型运行的潜空间。然而，现有分词器主要侧重于提升重建保真度或继承预训练表示，尚不清楚何种潜空间真正有利于生成式建模。本文从潜流形组织的角度探究这一问题。通过构建可控的分词器变体，我们识别出扩散友好型潜流形的三个关键特性：连贯的空间结构、局部流形连续性以及全局流形语义。我们发现，这些特性相较于重建保真度与下游生成质量具有更强的一致性。基于此发现，我们提出了先验对齐自编码器（PAE），该方法通过显式塑造潜流形，而非依赖重建或继承间接形成扩散友好型流形。具体而言，PAE利用从视觉基础模型（VFM）中提取的精炼先验以及基于扰动的正则化，将空间结构、局部连续性和全局语义转化为显式训练目标。在ImageNet 256x256数据集上，PAE在训练效率和生成质量上均优于现有分词器，在相同训练设置下实现与RAE相当的性能，同时收敛速度提升高达13倍，并创下1.03的gFID新最优纪录。这些结果凸显了组织潜流形对于潜扩散模型的重要性。

English

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.

什么决定了扩散友好的潜流形？面向潜扩散的先验对齐自编码器

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

摘要

Support