ChatPaper.aiChatPaper

提升隐空间扩散模型:基于解耦表征对齐的方法

Boosting Latent Diffusion Models via Disentangled Representation Alignment

January 9, 2026
作者: John Page, Xuesong Niu, Kai Wu, Kun Gai
cs.AI

摘要

潜在扩散模型(LDMs)通过操作压缩的潜在空间生成高质量图像,该空间通常经由变分自编码器(VAEs)等图像标记器获得。为构建适用于生成的VAE,近期研究探索将视觉基础模型(VFMs)作为VAE的表征对齐目标,这与LDMs常用策略相呼应。尽管此举带来一定性能提升,但对VAE和LDM使用相同对齐目标忽略了两者根本不同的表征需求。我们主张:LDMs受益于保留高层语义概念的潜在表征,而VAE则应擅长语义解耦,能以结构化方式编码属性级信息。为此,我们提出语义解耦VAE(Send-VAE),通过将VAE潜在空间与预训练VFMs的语义层次对齐,显式优化其解耦表征学习能力。该方法采用非线性映射网络转换VAE潜在变量,使其与VFMs对齐,从而弥合属性级解耦与高层语义间的鸿沟,为VAE学习提供有效引导。我们通过属性预测任务的线性探针评估语义解耦效果,发现其与生成性能提升高度相关。最终,基于Send-VAE训练流式变换器SiTs的实验表明:该方法显著加速训练进程,在ImageNet 256×256数据集上使用/不使用无分类器引导时,分别达到1.21和1.75的最新FID指标。
English
Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.
PDF173January 31, 2026