RhymeFlow：基於非同步去噪流排程的無需訓練影片生成加速方法

摘要

基於擴散變換器（DiTs）的視頻生成模型在視頻合成中展現了卓越性能，但由於3D注意力機制的二次複雜度，其推理延遲與計算成本居高不下。現有加速方法主要透過稀疏注意力與KV快取等技術，降低單一去噪步驟內的計算複雜度。然而，這些方法嚴格遵循標準擴散管線的固有約束：目標視頻序列中的每一幀都必須在所有擴散時間步中經歷完整且密集的去噪過程。我們觀察到，由於相鄰幀之間的內容與運動對應關係，當錨定具有關鍵語義轉變的關鍵幀時，其他幀的中間狀態往往遵循更可預測的軌跡，這表明這類均勻密集的去噪過程對自然視頻數據而言本質上存在冗餘。為此，我們提出RhymeFlow，一個免訓練框架，可解耦不同幀的去噪軌跡。具體而言，我們首先識別出一組稀疏的「關鍵關鍵幀」，它們主導潛在語義的演化。接著，僅對這些關鍵幀進行密集的逐步去噪以確保結構完整性，而非關鍵幀則逐步跳過去噪步驟以減少計算開銷。由於非關鍵幀被跳過的中間狀態會破壞關鍵幀去噪步驟中的時間連貫性，導致視覺品質下降，我們進一步引入潛在軌跡投影模組，使關鍵幀能與完整且時序一致的序列表徵進行交互。在當前基於DiT的視頻生成模型上進行的廣泛實驗表明，我們的方法在推理速度與視覺品質上均優於現有基準方法。

English

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce RhymeFlow, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.