語義引領前進:基於非同步潛在擴散的語義與紋理建模協調
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
December 4, 2025
作者: Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng
cs.AI
摘要
潛在擴散模型(LDM)本質上遵循由粗到精的生成過程,高層語義結構的生成會略微早於細粒度紋理。這表明先形成的語義可通過提供語義錨點來輔助紋理生成。近期研究雖整合了預訓練視覺編碼器的語義先驗來增強LDM,但仍同步對語義與VAE編碼紋理進行去噪,忽略了此種時序關係。基於此觀察,我們提出語義優先擴散模型(SFD),這是一種顯式優先構建語義的潛在擴散範式。SFD首先通過專用語義VAE從預訓練視覺編碼器提取緊湊語義潛變量,並將其與紋理潛變量組合構建複合潛變量。SFD的核心在於採用分離的噪聲調度異步去噪語義與紋理潛變量:語義去噪以時間偏移量領先於紋理,為紋理優化提供更清晰的高層指導,實現自然的由粗到精生成。在引導條件下的ImageNet 256x256數據集上,SFD達成FID 1.06(LightningDiT-XL)與FID 1.04(10億參數LightningDiT-XXL),且收斂速度較原始DiT提升達100倍。SFD亦能改進ReDi與VA-VAE等現有方法,證明了異步語義主導建模的有效性。項目頁面與代碼:https://yuemingpan.github.io/SFD.github.io/。
English
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.