SANA-串流：基於混合擴散Transformer的即時串流影片編輯

摘要

即時串流視訊到視訊編輯（V2V）對於直播與遊戲等互動式應用至關重要，然而由於對時間一致性與推論吞吐量的嚴格要求，這仍是一項嚴峻挑戰。本文提出 SANA-Streaming，這是一個系統與演算法共同設計的框架，可在消費級 GPU 上實現高解析度即時串流視訊編輯，其核心設計包含以下三點：（1）混合擴散轉換器架構，在部分區塊中引入 softmax 注意力以提升局部建模能力，同時保留線性層的效率；（2）循環反向正則化，這是一種新穎的訓練策略，透過流匹配從生成內容預測來源幀來強化語義一致性，無需配對的長篇編輯影片即可改善時間一致性；（3）高效能系統協同設計，結合針對 NVIDIA Blackwell（RTX 5090）架構最佳化的融合 GDN 核心與混合精度量化（MPQ）。透過真實世界吞吐量分析，我們的 MPQ 在維持生成品質的同時，最大化張量核心使用率。最終系統可在單張 RTX 5090 GPU 上，以 24 端到端 FPS 達成即時 1280×704 解析度編輯，其中 DiT 核心部分更達到 58 FPS。實驗結果證明，我們的協同設計方法在時間一致性與系統吞吐量上均顯著優於現有最先進方法。

English

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.