ChatPaper.aiChatPaper

高效-vDiT:具有注意力瓦片的高效視訊擴散Transformer

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

February 10, 2025
作者: Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang
cs.AI

摘要

儘管擁有合成高保真度影片的潛力,具有3D全注意力的擴散Transformer(DiTs)在推理方面存在昂貴的問題,這是由於注意力計算的複雜性和眾多的採樣步驟所導致的。例如,流行的Open-Sora-Plan模型在生成一個包含29幀的影片時需要超過9分鐘。本文從兩個方面解決了效率問題:1)根據影片數據內部的冗餘性對3D全注意力進行修剪;我們識別了影片數據中3D注意力地圖中普遍存在的瓷磚式重複模式,並提倡一個新的稀疏3D注意力家族,其對於影片幀數具有線性複雜度。2)通過採用現有的多步一致性蒸餾來縮短採樣過程;我們將整個採樣軌跡分成幾個段落,在每個段落內進行一致性蒸餾,以激活少步生成能力。我們進一步設計了一個三階段的訓練流程,將低複雜度的注意力和少步生成能力結合在一起。值得注意的是,我們使用0.1%的預訓練數據,將Open-Sora-Plan-1.2模型轉變為一個高效率模型,對於生成包含29和93幀720p影片的速度提升了7.4倍至7.8倍,並在VBench上僅有輕微的性能折衷。此外,我們展示了我們的方法適用於分佈式推理,當在4個GPU上運行並具有序列並行性時,實現了額外的3.91倍加速。
English
Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.

Summary

AI-Generated Summary

PDF102February 11, 2025