SANA-Streaming: ハイブリッド拡散トランスフォーマーによるリアルタイムストリーミング動画編集

要旨

リアルタイムストリーミングビデオ間編集（V2V）は、ライブ配信やゲームなどのインタラクティブなアプリケーションにとって極めて重要であるが、時間的一貫性と推論スループットに対する厳格な要件のために、依然として困難な課題である。本論文では、SANA-Streamingを提案する。これは、コンシューマGPU上での高解像度リアルタイムストリーミングビデオ編集のためのシステム-アルゴリズム協調設計フレームワークであり、以下の3つの中核的設計を持つ。(1) ハイブリッド拡散Transformerアーキテクチャは、一部のブロックにソフトマックスアテンションを導入することで、線形層の効率を維持しつつ局所的なモデリング能力を向上させる。(2) サイクルリバース正則化は、フローマッチングを介して生成コンテンツからソースフレームを予測することにより意味的一貫性を強制する新規な学習戦略であり、ペア化された長編編集ビデオを必要とせずに時間的一貫性を向上させる。(3) 効率的なシステム協調設計は、NVIDIA Blackwell (RTX 5090)アーキテクチャ向けに最適化された融合GDNカーネルと混合精度量子化（MPQ）を組み合わせる。実世界のスループットをプロファイリングすることにより、我々のMPQは生成品質を維持しつつTensor Core利用率を最大化する。結果として得られたシステムは、単一のRTX 5090 GPU上で1280×704解像度のリアルタイム編集をエンドツーエンド24 FPSで実現し、DiTコアは58 FPSで動作する。実験結果は、我々の協調設計アプローチが時間的コヒーレンスとシステムスループットの両方において既存の最先端手法を大幅に上回ることを示している。

English

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.