关键区域上采样:面向加速扩散变换器的区域自适应潜在采样
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
July 11, 2025
作者: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun
cs.AI
摘要
扩散变换器作为基于U-net的扩散模型的替代方案,在高保真图像和视频生成领域崭露头角,展现出卓越的可扩展性。然而,其庞大的计算量仍是实际部署中的主要障碍。现有的加速方法主要利用时间维度,如在扩散时间步间重用缓存特征。本文提出了一种无需训练的框架——区域自适应潜在上采样(RALU),旨在从空间维度加速推理过程。RALU通过三个阶段实现混合分辨率采样:1)低分辨率去噪潜在扩散,以高效捕捉全局语义结构;2)对全分辨率下易产生伪影的特定区域进行区域自适应上采样;3)全分辨率下的所有潜在上采样,用于细节精修。为确保分辨率转换间生成的稳定性,我们采用噪声时间步重调度策略,以适应不同分辨率下的噪声水平。该方法在显著减少计算量的同时,保持了图像质量,在FLUX上实现了高达7.0倍的加速,在Stable Diffusion 3上达到3.0倍,且质量损失极小。此外,RALU与现有的时间维度加速方法(如缓存技术)互补,可无缝集成以进一步降低推理延迟,而不影响生成质量。
English
Diffusion transformers have emerged as an alternative to U-net-based
diffusion models for high-fidelity image and video generation, offering
superior scalability. However, their heavy computation remains a major obstacle
to real-world deployment. Existing acceleration methods primarily exploit the
temporal dimension such as reusing cached features across diffusion timesteps.
Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free
framework that accelerates inference along spatial dimension. RALU performs
mixed-resolution sampling across three stages: 1) low-resolution denoising
latent diffusion to efficiently capture global semantic structure, 2)
region-adaptive upsampling on specific regions prone to artifacts at
full-resolution, and 3) all latent upsampling at full-resolution for detail
refinement. To stabilize generations across resolution transitions, we leverage
noise-timestep rescheduling to adapt the noise level across varying
resolutions. Our method significantly reduces computation while preserving
image quality by achieving up to 7.0times speed-up on FLUX and 3.0times
on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is
complementary to existing temporal accelerations such as caching methods, thus
can be seamlessly integrated to further reduce inference latency without
compromising generation quality.