ChatPaper.aiChatPaper

分布匹配变分自编码器

Distribution Matching Variational AutoEncoder

December 8, 2025
作者: Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu
cs.AI

摘要

当前大多数视觉生成模型在应用扩散或自回归建模前,会将图像压缩至潜在空间。然而,现有方法(如VAE和基础模型对齐编码器)在未显式塑造分布的情况下隐式约束了潜在空间,导致难以确定何种分布最适合建模。我们提出分布匹配变分自编码器(DMVAE),通过分布匹配约束显式对齐编码器潜在分布与任意参考分布。该方法突破了传统VAE高斯先验的局限,可实现与自监督特征、扩散噪声或其他先验分布的对接。基于DMVAE,我们系统探究了何种潜在分布更利于建模,发现自监督学习衍生的分布能在重建保真度与建模效率间取得最佳平衡——仅需64个训练周期即在ImageNet上达到gFID=3.2。实验表明:通过分布级对齐选择合适潜在分布结构(而非依赖固定先验),是弥合易建模潜在空间与高保真图像生成之间差距的关键。代码已开源:https://github.com/sen-ye/dmvae。
English
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce Distribution-Matching VAE (DMVAE), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.
PDF192December 10, 2025