SANA-Video: ブロック線形拡散トランスフォーマーによる効率的な動画生成

要旨

SANA-Videoを紹介する。これは、720x1280解像度で分単位の長さの動画を効率的に生成できる小型拡散モデルである。SANA-Videoは、RTX 5090 GPU上で展開可能な高速な処理速度で、高解像度・高品質かつ長時間の動画を強力なテキスト-動画の整合性を持って合成する。効率的で効果的かつ長時間の動画生成を実現するための2つのコア設計がある：(1) Linear DiT：ビデオ生成において処理される大量のトークンを考慮し、バニラアテンションよりも効率的な線形アテンションをコア操作として活用する。(2) Block Linear Attentionのための定数メモリKVキャッシュ：線形アテンションの累積特性から導出される定数メモリ状態を採用し、ブロック単位の自己回帰的アプローチを設計することで、長時間の動画生成を可能にする。このKVキャッシュは、固定メモリコストでLinear DiTにグローバルコンテキストを提供し、従来のKVキャッシュを不要とし、効率的な分単位の動画生成を実現する。さらに、効果的なデータフィルタとモデルトレーニング戦略を探求し、64台のH100 GPU上でのトレーニングコストを12日に短縮し、MovieGenのコストのわずか1%に抑えた。低コストであるにもかかわらず、SANA-Videoは現代の最先端の小型拡散モデル（例：Wan 2.1-1.3BやSkyReel-V2-1.3B）と比較して競争力のある性能を達成し、測定されたレイテンシでは16倍高速である。さらに、SANA-VideoはNVFP4精度でRTX 5090 GPU上に展開可能であり、5秒間の720p動画生成の推論速度を71秒から29秒に加速する（2.4倍の高速化）。要約すると、SANA-Videoは低コストで高品質な動画生成を可能にする。

English

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.