SANA-Streaming：基于混合扩散Transformer的实时流式视频编辑

摘要

实时流式视频到视频编辑（V2V）对于直播、游戏等交互式应用至关重要，但由于对时间一致性和推理吞吐量的严苛要求，这仍是一项严峻挑战。本文提出SANA-Streaming，一种面向消费级GPU的高分辨率实时流式视频编辑系统-算法协同设计框架，其核心包含以下三项设计：（1）混合扩散Transformer架构在部分模块中引入softmax注意力机制，提升局部建模能力的同时保持线性层的效率。（2）循环反向正则化是一种新型训练策略，通过流匹配从生成内容预测源帧来强制语义一致性，无需成对的长时编辑视频即可提升时间一致性。（3）高效系统协同设计结合了针对NVIDIA Blackwell（RTX 5090）架构优化的融合GDN内核与混合精度量化（MPQ）。通过分析实际吞吐量，我们的MPQ在保持生成质量的同时最大化张量核心利用率。该系统在单张RTX 5090 GPU上实现1280×704分辨率实时编辑，端到端帧率24 FPS，其中DiT核心运行帧率达58 FPS。实验结果表明，本协同设计方案在时间一致性和系统吞吐量方面均显著优于现有最先进方法。

English

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.