HarmoniCa:在扩散Transformer加速中实现训练和推断的协调以改善特征缓存
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration
October 2, 2024
作者: Yushi Huang, Zining Wang, Ruihao Gong, Jing Liu, Xinjie Zhang, Jinyang Guo, Xianglong Liu, Jun Zhang
cs.AI
摘要
扩散变压器(DiTs)因在生成任务中具有出色的可扩展性和非凡的性能而备受关注。然而,其相当大的推理成本阻碍了实际部署。特征缓存机制涉及在时间步之间存储和检索冗余计算,有望降低扩散模型中每步推理时间。大多数现有的DiT缓存方法是手动设计的。尽管基于学习的方法试图自适应地优化策略,但由于训练和推理之间存在差异,导致性能和加速比均受到影响。经过详细分析,我们确定这些差异主要源自两个方面:(1)先前时间步忽略,即训练忽略了先前时间步中缓存使用的影响,以及(2)目标不匹配,即训练目标(使每个时间步中的预测噪声对齐)偏离了推理目标(生成高质量图像)。为了缓解这些差异,我们提出了HarmoniCa,这是一种新颖的方法,它通过基于步骤逐渐去噪训练(SDT)和图像误差代理引导目标(IEPO)构建了一种新颖的基于学习的缓存框架,从而协调训练和推理。与传统训练范式相比,新提出的SDT保持了去噪过程的连续性,使模型能够在训练期间利用先前时间步的信息,类似于推理期间的操作方式。此外,我们设计了IEPO,它集成了一个高效的代理机制,以近似重复使用缓存特征引起的最终图像误差。因此,IEPO有助于平衡最终图像质量和缓存利用,解决了仅考虑训练中每个时间步中缓存使用对预测输出的影响的问题。
English
Diffusion Transformers (DiTs) have gained prominence for outstanding
scalability and extraordinary performance in generative tasks. However, their
considerable inference costs impede practical deployment. The feature cache
mechanism, which involves storing and retrieving redundant computations across
timesteps, holds promise for reducing per-step inference time in diffusion
models. Most existing caching methods for DiT are manually designed. Although
the learning-based approach attempts to optimize strategies adaptively, it
suffers from discrepancies between training and inference, which hampers both
the performance and acceleration ratio. Upon detailed analysis, we pinpoint
that these discrepancies primarily stem from two aspects: (1) Prior Timestep
Disregard, where training ignores the effect of cache usage at earlier
timesteps, and (2) Objective Mismatch, where the training target (align
predicted noise in each timestep) deviates from the goal of inference (generate
the high-quality image). To alleviate these discrepancies, we propose
HarmoniCa, a novel method that Harmonizes training and inference with a novel
learning-based Caching framework built upon Step-Wise Denoising Training (SDT)
and Image Error Proxy-Guided Objective (IEPO). Compared to the traditional
training paradigm, the newly proposed SDT maintains the continuity of the
denoising process, enabling the model to leverage information from prior
timesteps during training, similar to the way it operates during inference.
Furthermore, we design IEPO, which integrates an efficient proxy mechanism to
approximate the final image error caused by reusing the cached feature.
Therefore, IEPO helps balance final image quality and cache utilization,
resolving the issue of training that only considers the impact of cache usage
on the predicted output at each timestep.Summary
AI-Generated Summary