SANA-视频：基于块线性扩散Transformer的高效视频生成

摘要

我们推出SANA-Video，这是一款小型扩散模型，能够高效生成分辨率高达720x1280、时长可达一分钟的视频。SANA-Video以极快的速度合成高分辨率、高质量的长视频，并实现强大的文本-视频对齐，可在RTX 5090 GPU上部署。其高效、有效且支持长视频生成的核心设计包括两点：(1) 线性DiT：我们采用线性注意力作为核心操作，相较于传统注意力机制，在处理大量视频生成所需的token时更为高效。(2) 块线性注意力的恒定内存KV缓存：通过利用线性注意力的累积特性，我们设计了基于恒定内存状态的块级自回归方法，用于生成长视频。这种KV缓存以固定内存成本为线性DiT提供全局上下文，无需传统KV缓存，从而实现了高效的一分钟视频生成。此外，我们探索了有效的数据过滤器和模型训练策略，将训练成本压缩至64台H100 GPU上12天完成，仅为MovieGen成本的1%。凭借其低成本，SANA-Video在性能上与现代最先进的小型扩散模型（如Wan 2.1-1.3B和SkyReel-V2-1.3B）相媲美，同时实测延迟降低了16倍。更重要的是，SANA-Video可在RTX 5090 GPU上以NVFP4精度部署，将生成5秒720p视频的推理速度从71秒加速至29秒（提速2.4倍）。总之，SANA-Video实现了低成本、高质量的视频生成。

English

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

SANA-视频：基于块线性扩散Transformer的高效视频生成

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

摘要

Support