ChatPaper.aiChatPaper

SANA-视频:基于块线性扩散Transformer的高效视频生成

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

September 29, 2025
作者: Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie
cs.AI

摘要

我们推出SANA-Video,这是一款小型扩散模型,能够高效生成分辨率高达720x1280、时长可达一分钟的视频。SANA-Video以极快的速度合成高分辨率、高质量的长视频,并实现强大的文本-视频对齐,可在RTX 5090 GPU上部署。其高效、有效且支持长视频生成的核心设计包括两点:(1) 线性DiT:我们采用线性注意力作为核心操作,相较于传统注意力机制,在处理大量视频生成所需的token时更为高效。(2) 块线性注意力的恒定内存KV缓存:通过利用线性注意力的累积特性,我们设计了基于恒定内存状态的块级自回归方法,用于生成长视频。这种KV缓存以固定内存成本为线性DiT提供全局上下文,无需传统KV缓存,从而实现了高效的一分钟视频生成。此外,我们探索了有效的数据过滤器和模型训练策略,将训练成本压缩至64台H100 GPU上12天完成,仅为MovieGen成本的1%。凭借其低成本,SANA-Video在性能上与现代最先进的小型扩散模型(如Wan 2.1-1.3B和SkyReel-V2-1.3B)相媲美,同时实测延迟降低了16倍。更重要的是,SANA-Video可在RTX 5090 GPU上以NVFP4精度部署,将生成5秒720p视频的推理速度从71秒加速至29秒(提速2.4倍)。总之,SANA-Video实现了低成本、高质量的视频生成。
English
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
PDF352September 30, 2025