ChatPaper.aiChatPaper

利用滑動式磚塊注意力快速生成視頻

Fast Video Generation with Sliding Tile Attention

February 6, 2025
作者: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang
cs.AI

摘要

擴散Transformer(DiTs)憑藉3D全注意力能力在視頻生成方面達到了最先進的水平,但受制於計算成本過高——僅生成一個5秒720P視頻時,注意力本身就佔據了總推理時間的945秒中的800秒。本文引入了滑動瓷磚注意力(STA)來應對這一挑戰。STA利用了預訓練視頻擴散模型中的注意力分數主要集中在局部化的3D窗口內這一觀察結果。通過在局部時空區域上滑動和關注,STA消除了全注意力中的冗余。與傳統的基於標記的滑動窗口注意力(SWA)不同,STA以瓷磚為單位進行操作,並採用一種新穎的硬件感知滑動窗口設計,保持了表達能力的同時具有高效硬件性能。通過細緻的核級優化,STA提供了首個高效的2D/3D滑動窗口式注意力實現,實現了58.79%的MFU。具體而言,STA將注意力加速了2.8-17倍,超過了FlashAttention-2(FA2)的1.6-10倍,超過了FlashAttention-3(FA3)。在領先的視頻DiT,HunyuanVideo上,STA將端到端延遲從945秒(FA3)降低到685秒,而無需降低質量,也無需進行訓練。啟用微調進一步將延遲降低到268秒,僅在VBench上下降了0.09%。
English
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.

Summary

AI-Generated Summary

PDF512February 10, 2025