ChatPaper.aiChatPaper

通过解耦表征对齐增强潜在扩散模型

Boosting Latent Diffusion Models via Disentangled Representation Alignment

January 9, 2026
作者: John Page, Xuesong Niu, Kai Wu, Kun Gai
cs.AI

摘要

潛在擴散模型(LDMs)通過在壓縮的潛在空間中操作來生成高質量圖像,該空間通常通過圖像標記器(如變分自編碼器VAEs)獲得。為構建更適合生成的VAE,近期研究探索藉助視覺基礎模型(VFMs)作為VAE的表徵對齊目標,這與LDMs常用策略相呼應。儘管此方法帶來一定性能提升,但對VAE和LDM使用相同對齊目標忽略了二者根本不同的表徵需求。我們主張:LDMs受益於保留高層語義概念的潛在表徵,而VAE應擅長語義解耦,能以結構化方式編碼屬性級信息。為此,我們提出語義解耦VAE(Send-VAE),通過將其潛在空間與預訓練VFMs的語義層次結構對齊,顯式優化解耦表徵學習。我們採用非線性映射網絡轉換VAE潛在變量,使其與VFMs對齊,以橋接屬性級解耦與高層語義的鴻溝,從而為VAE學習提供有效指導。我們通過屬性預測任務的線性探測評估語義解耦效果,顯示其與生成性能提升存在強關聯。最終,基於Send-VAE訓練流式變換器SiTs;實驗表明Send-VAE顯著加速訓練,在ImageNet 256×256數據集上使用/不使用無分類器引導時,分別達到1.21和1.75的最新FID指標。
English
Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.
PDF173January 31, 2026