ChatPaper.aiChatPaper

SANA-視頻:基於塊線性擴散變壓器的高效視頻生成

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

September 29, 2025
作者: Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie
cs.AI

摘要

我們介紹了SANA-Video,這是一款小型擴散模型,能夠高效生成分辨率高達720x1280、長度達分鐘級別的視頻。SANA-Video以驚人的速度合成高分辨率、高質量且時長較長的視頻,並實現了強烈的文本-視頻對齊,可在RTX 5090 GPU上部署。兩項核心設計確保了我們高效、有效且長視頻的生成:(1) 線性DiT:我們利用線性注意力作為核心操作,這在處理視頻生成中大量令牌時比傳統注意力機制更為高效。(2) 塊線性注意力的恆定記憶KV緩存:我們設計了一種基於塊的自回歸方法,通過採用從線性注意力累積特性中衍生的恆定記憶狀態來生成長視頻。此KV緩存以固定記憶成本為線性DiT提供全局上下文,消除了對傳統KV緩存的需求,從而實現了高效的分鐘級視頻生成。此外,我們探索了有效的數據過濾器和模型訓練策略,將訓練成本縮減至在64台H100 GPU上僅需12天,僅為MovieGen成本的1%。鑑於其低成本,SANA-Video在與現代最先進的小型擴散模型(如Wan 2.1-1.3B和SkyReel-V2-1.3B)相比時,展現出競爭力的性能,同時在測量延遲上快16倍。此外,SANA-Video可在RTX 5090 GPU上以NVFP4精度部署,將生成5秒720p視頻的推理速度從71秒加速至29秒(提速2.4倍)。總之,SANA-Video實現了低成本、高質量的視頻生成。
English
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
PDF352September 30, 2025