ChatPaper.aiChatPaper

基於全局與局部語義的潛表徵糾纏擴散模型

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

December 18, 2025
作者: Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, Ioannis Kakogeorgiou
cs.AI

摘要

潛在擴散模型(LDMs)在圖像合成領域已實現最先進水平,但其重建式去噪目標僅提供間接的語義監督:高層語義緩慢湧現,需更長訓練時間且限制樣本品質。近期研究通過表徵對齊從外部注入視覺基礎模型(VFMs)的語義,或僅在擴散過程內部聯合建模狹窄的VFM特徵切片,未能充分利用其豐富、非線性、多層次的空間語義。我們提出REGLUE(全域-局部統一編碼的表徵糾纏框架),在單一SiT骨幹網中聯合建模(i)VAE圖像潛變量、(ii)緊湊的局部(圖塊級)VFM語義,以及(iii)全域(圖像級)[CLS]標記。輕量級卷積語義壓縮器將多層VFM特徵非線性聚合為低維度空間結構化表徵,並在擴散過程中與VAE潛變量形成糾纏。外部對齊損失進一步將內部表徵正則化至凍結的VFM目標。在ImageNet 256×256數據集上,REGLUE相較SiT-B/2與SiT-XL/2基線模型,以及REPA、ReDi和REG方法,持續提升FID指標並加速收斂。大量實驗表明:(a)空間VFM語義至關重要,(b)非線性壓縮是釋放其全部效益的關鍵,(c)在全域-局部-潛變量聯合建模框架中,全域標記與外部對齊可作為互補的輕量級增強機制。程式碼已開源於:https://github.com/giorgospets/reglue。
English
Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .
PDF192December 20, 2025