OSP-Next:使用稀疏序列並行、HiF8量化與強化學習的高效高品質影片生成
OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
May 27, 2026
作者: Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu, Xinhua Cheng, Li Yuan
cs.AI
摘要
擴散變換器在影片生成品質上表現優異,但完整注意力機制的平方級成本限制了效率。我們提出 OSP-Next,一個整合稀疏注意力、並行化、量化與強化學習的高效文字轉影片生成模型。OSP-Next 採用混合式完整-稀疏注意力架構,其中稀疏部分透過 Skiparse-2D 注意力機制實現。此固定模式機制沿空間維度執行逐詞元與逐群組的稀疏注意力,在利用局部性的同時保持與 FlashAttention 核心的原生相容性。基於 Skiparse-2D 注意力中重排操作的局部等價性,我們進一步提出稀疏序列並行(SSP),該機制將子序列分割至不同執行緒,並透過單次 All-to-All 通訊切換稀疏模式。相較於尤利西斯序列並行(SP),SSP 為稀疏注意力提供了原生的並行策略,並將通訊量減少 75%。OSP-Next 亦納入 HiF8 量化,以實現 8 位元量化與稀疏微調的穩定聯合訓練,並應用 Mix-GRPO 後訓練以提升稀疏模型的效能。實驗結果顯示,OSP-Next 的 VBench 總分達到 83.73%,超越了 Wan2.1 基線模型。在 5 秒 720P 與 5 秒 768P 設定下,OSP-Next 在 NVIDIA H200 GPU 上分別實現最高 1.64 倍的單 GPU 加速比與超過 1.52 倍的八 GPU 加速比。此外,僅以 VBench 總分下降 0.4% 的代價,OSP-Next-HiF8 在單張昇騰 950PR 上於上述兩種設定下分別獲得 1.69 倍與 2.27 倍的加速,展現了 OSP-Next 在不同硬體平台上的效率與效能。
English
Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.