提升采样之关键：区域自适应潜在采样加速扩散变换器

摘要

擴散變換器已作為基於U-net的擴散模型的一種替代方案，用於高保真圖像和視頻生成，提供了更優的可擴展性。然而，其龐大的計算量仍是實際部署中的主要障礙。現有的加速方法主要利用時間維度，如在擴散時間步之間重用緩存特徵。本文提出了一種無需訓練的框架——區域自適應潛在升採樣（RALU），該框架沿空間維度加速推理。RALU在三階段進行混合分辨率採樣：1）低分辨率去噪潛在擴散，以高效捕捉全局語義結構；2）對易於在全分辨率下產生偽影的特定區域進行區域自適應升採樣；3）全分辨率下的所有潛在升採樣，以進行細節精煉。為了穩定跨分辨率轉換的生成，我們利用噪聲時間步重調度來適應不同分辨率下的噪聲水平。該方法在保持圖像質量的同時顯著減少了計算量，在FLUX上實現了高達7.0倍的加速，在Stable Diffusion 3上實現了3.0倍的加速，且質量下降極小。此外，RALU與現有的時間加速方法（如緩存方法）互補，因此可以無縫集成，進一步減少推理延遲而不影響生成質量。

English

Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0times speed-up on FLUX and 3.0times on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

提升采样之关键：区域自适应潜在采样加速扩散变换器

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

摘要

Support