ChatPaper.aiChatPaper

关键区域上采样:面向加速扩散变换器的区域自适应潜在采样

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

July 11, 2025
作者: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun
cs.AI

摘要

扩散变换器作为基于U-net的扩散模型的替代方案,在高保真图像和视频生成领域崭露头角,展现出卓越的可扩展性。然而,其庞大的计算量仍是实际部署中的主要障碍。现有的加速方法主要利用时间维度,如在扩散时间步间重用缓存特征。本文提出了一种无需训练的框架——区域自适应潜在上采样(RALU),旨在从空间维度加速推理过程。RALU通过三个阶段实现混合分辨率采样:1)低分辨率去噪潜在扩散,以高效捕捉全局语义结构;2)对全分辨率下易产生伪影的特定区域进行区域自适应上采样;3)全分辨率下的所有潜在上采样,用于细节精修。为确保分辨率转换间生成的稳定性,我们采用噪声时间步重调度策略,以适应不同分辨率下的噪声水平。该方法在显著减少计算量的同时,保持了图像质量,在FLUX上实现了高达7.0倍的加速,在Stable Diffusion 3上达到3.0倍,且质量损失极小。此外,RALU与现有的时间维度加速方法(如缓存技术)互补,可无缝集成以进一步降低推理延迟,而不影响生成质量。
English
Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0times speed-up on FLUX and 3.0times on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.