提升采样之关键:区域自适应潜在采样加速扩散变换器
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
July 11, 2025
作者: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun
cs.AI
摘要
擴散變換器已作為基於U-net的擴散模型的一種替代方案,用於高保真圖像和視頻生成,提供了更優的可擴展性。然而,其龐大的計算量仍是實際部署中的主要障礙。現有的加速方法主要利用時間維度,如在擴散時間步之間重用緩存特徵。本文提出了一種無需訓練的框架——區域自適應潛在升採樣(RALU),該框架沿空間維度加速推理。RALU在三階段進行混合分辨率採樣:1)低分辨率去噪潛在擴散,以高效捕捉全局語義結構;2)對易於在全分辨率下產生偽影的特定區域進行區域自適應升採樣;3)全分辨率下的所有潛在升採樣,以進行細節精煉。為了穩定跨分辨率轉換的生成,我們利用噪聲時間步重調度來適應不同分辨率下的噪聲水平。該方法在保持圖像質量的同時顯著減少了計算量,在FLUX上實現了高達7.0倍的加速,在Stable Diffusion 3上實現了3.0倍的加速,且質量下降極小。此外,RALU與現有的時間加速方法(如緩存方法)互補,因此可以無縫集成,進一步減少推理延遲而不影響生成質量。
English
Diffusion transformers have emerged as an alternative to U-net-based
diffusion models for high-fidelity image and video generation, offering
superior scalability. However, their heavy computation remains a major obstacle
to real-world deployment. Existing acceleration methods primarily exploit the
temporal dimension such as reusing cached features across diffusion timesteps.
Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free
framework that accelerates inference along spatial dimension. RALU performs
mixed-resolution sampling across three stages: 1) low-resolution denoising
latent diffusion to efficiently capture global semantic structure, 2)
region-adaptive upsampling on specific regions prone to artifacts at
full-resolution, and 3) all latent upsampling at full-resolution for detail
refinement. To stabilize generations across resolution transitions, we leverage
noise-timestep rescheduling to adapt the noise level across varying
resolutions. Our method significantly reduces computation while preserving
image quality by achieving up to 7.0times speed-up on FLUX and 3.0times
on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is
complementary to existing temporal accelerations such as caching methods, thus
can be seamlessly integrated to further reduce inference latency without
compromising generation quality.