分佈匹配變分自編碼器
Distribution Matching Variational AutoEncoder
December 8, 2025
作者: Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu
cs.AI
摘要
目前大多數視覺生成模型會先將圖像壓縮至潛在空間,再應用擴散或自回歸建模。然而,現有方法(如VAE和基礎模型對齊編碼器)僅隱性約束潛在空間,未顯式塑造其分佈,導致最適合建模的分佈類型尚不明確。我們提出分佈匹配變分自編碼器(DMVAE),通過分佈匹配約束顯式將編碼器的潛在分佈與任意參考分佈對齊。該方法突破了傳統VAE的高斯先驗限制,可實現與自監督特徵、擴散噪聲或其他先驗分佈的對齊。基於DMVAE,我們能系統性探究何種潛在分佈更有利於建模,並發現自監督學習衍生的分佈能在重構保真度與建模效率間取得絕佳平衡——僅需64個訓練週期即可在ImageNet上達到gFID=3.2。實驗結果表明:選擇合適的潛在分佈結構(通過分佈層級對齊實現),而非依賴固定先驗,是彌合易建模潛變量與高保真圖像合成間差距的關鍵。程式碼已開源於https://github.com/sen-ye/dmvae。
English
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce Distribution-Matching VAE (DMVAE), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.