语义引领方向:通过异步潜在扩散实现语义与纹理建模的和谐统一
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
December 4, 2025
作者: Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng
cs.AI
摘要
潜在扩散模型(LDMS)本质上遵循由粗到细的生成过程,其中高层语义结构的生成略早于细粒度纹理。这表明先行的语义可通过提供语义锚点来促进纹理生成。近期研究通过整合预训练视觉编码器的语义先验来增强LDMS,但仍同步对语义与VAE编码的纹理进行去噪,忽视了这种时序差异。基于此,我们提出语义优先扩散(SFD),一种显式优先构建语义的潜在扩散范式。SFD首先通过专用语义VAE从预训练视觉编码器提取紧凑语义潜在表示,将其与纹理潜在表示组合成复合潜在表示。SFD的核心在于采用分离的噪声调度异步去噪语义与纹理潜在表示:语义通过时间偏移先于纹理生成,为纹理优化提供更清晰的高层指导,实现自然的由粗到细生成。在ImageNet 256x256引导生成任务中,SFD实现了FID 1.06(LightningDiT-XL)和FID 1.04(1.0B LightningDiT-XXL),同时收敛速度比原始DiT提升高达100倍。SFD还能改进ReDi、VA-VAE等现有方法,证明了异步语义主导建模的有效性。项目页面与代码:https://yuemingpan.github.io/SFD.github.io/。
English
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.