影響擴散友善潛在流形的關鍵因素為何?用於潛在擴散的先驗對齊自編碼器
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
May 8, 2026
作者: Zhengrong Yue, Taihang Hu, Mengting Chen, Haiyu Zhang, Zihao Pan, Tao Liu, Zikang Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Yali Wang
cs.AI
摘要
標記器是潛在擴散模型的關鍵組成部分,因為它們定義了擴散模型運作的潛在空間。然而,現有標記器主要旨在提升重建保真度或繼承預訓練表徵,使得真正有利於生成建模的潛在空間應具備何種特性仍不明確。本文從潛在流形組織的視角探討此問題。通過建構受控的標記器變體,我們確立了擴散友好型潛在流形的三項關鍵屬性:連貫的空間結構、局部流形連續性,以及全局流形語義。我們發現這些屬性與下游生成品質的關聯性比重建保真度更為一致。基於此發現,我們提出先驗對齊自編碼器(PAE),其明確塑造潛在流形,而非依賴重建或繼承間接生成擴散友好型流形。具體而言,PAE利用從視覺基礎模型(VFM)提煉的先驗知識與擾動正則化,將空間結構、局部連續性與全局語義轉化為明確的訓練目標。在ImageNet 256x256上,PAE相較現有標記器同時提升訓練效率與生成品質:在相同訓練配置下,收斂速度最高提升13倍且達到與RAE相當的性能,並以1.03的gFID創下新紀錄。這些結果凸顯了為潛在擴散模型組織潛在流形的重要性。
English
Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.