关键区域上采样：面向加速扩散变换器的区域自适应潜在采样

摘要

扩散变换器作为基于U-net的扩散模型的替代方案，在高保真图像和视频生成领域崭露头角，展现出卓越的可扩展性。然而，其庞大的计算量仍是实际部署中的主要障碍。现有的加速方法主要利用时间维度，如在扩散时间步间重用缓存特征。本文提出了一种无需训练的框架——区域自适应潜在上采样（RALU），旨在从空间维度加速推理过程。RALU通过三个阶段实现混合分辨率采样：1）低分辨率去噪潜在扩散，以高效捕捉全局语义结构；2）对全分辨率下易产生伪影的特定区域进行区域自适应上采样；3）全分辨率下的所有潜在上采样，用于细节精修。为确保分辨率转换间生成的稳定性，我们采用噪声时间步重调度策略，以适应不同分辨率下的噪声水平。该方法在显著减少计算量的同时，保持了图像质量，在FLUX上实现了高达7.0倍的加速，在Stable Diffusion 3上达到3.0倍，且质量损失极小。此外，RALU与现有的时间维度加速方法（如缓存技术）互补，可无缝集成以进一步降低推理延迟，而不影响生成质量。

English

Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0times speed-up on FLUX and 3.0times on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

关键区域上采样：面向加速扩散变换器的区域自适应潜在采样

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

摘要

Support