通过全局与局部语义重构隐变量以实现纠缠扩散
REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion
December 18, 2025
作者: Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, Ioannis Kakogeorgiou
cs.AI
摘要
潜在扩散模型(LDMS)在图像生成领域实现了最先进的性能,但其重建式去噪目标仅提供间接的语义监督:高级语义特征缓慢浮现,需更长训练时间且限制生成质量。近期研究通过两种方式注入视觉基础模型(VFMs)的语义信息:要么通过外部表征对齐,要么仅在扩散过程内部联合建模VFMs的局部特征片段,未能充分利用其丰富的非线性多层空间语义。我们提出REGLUE(全局-局部统一编码的表征纠缠框架),该统一潜在扩散框架在单个SiT主干网络中联合建模:(i)VAE图像潜在表征,(ii)紧凑的局部(块级)VFM语义,以及(iii)全局(图像级)[CLS]标记。轻量级卷积语义压缩器将多层VFM特征非线性聚合为低维空间结构化表征,在扩散过程中与VAE潜在表征形成纠缠。外部对齐损失进一步将内部表征正则化至冻结的VFM目标。在ImageNet 256×256数据集上,REGLUE相较于SiT-B/2和SiT-XL/2基线,以及REPA、ReDi和REG方法,持续提升FID指标并加速收敛。大量实验表明:(a)空间VFM语义至关重要,(b)非线性压缩是释放其全部效益的关键,(c)全局标记与外部对齐在我们提出的全局-局部-潜在联合建模框架中起到互补的轻量级增强作用。代码已开源:https://github.com/giorgospets/reglue。
English
Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .