OSP-Next：结合稀疏序列并行、HiF8量化与强化学习的高效高质量视频生成

摘要

扩散变换器在视频生成中取得了优异的质量，但全注意力的二次方成本限制了效率。我们提出OSP-Next，一种高效文本到视频生成模型，集成了稀疏注意力、并行化、量化与强化学习。OSP-Next采用混合全-稀疏注意力架构，其中稀疏部分通过Skiparse-2D注意力实现。该固定模式机制沿空间维度施加逐令牌与逐组稀疏注意力，在利用局部性的同时保持与FlashAttention内核的原生兼容性。基于Skiparse-2D注意力中重排的局部等价性，我们进一步提出稀疏序列并行（SSP），该策略将子序列分布到多个计算节点，并通过单次全对全通信切换稀疏模式。与尤利西斯序列并行（SP）相比，SSP为稀疏注意力提供了原生并行策略，并将通信量减少75%。OSP-Next还引入HiF8量化，实现8比特量化下的稳定联合训练与稀疏微调，并应用Mix-GRPO后训练以提升稀疏模型性能。实验表明，OSP-Next的VBench总得分达83.73%，超越Wan2.1基线。在5秒720P与5秒768P设置下，OSP-Next在NVIDIA H200 GPU上分别实现最高1.64倍单GPU加速与超过1.52倍八GPU加速。此外，在单个昇腾950PR上，OSP-Next-HiF8在两种设置下仅以0.4%的VBench总得分损失，便实现1.69倍与2.27倍加速，展示了OSP-Next跨硬件平台的效率与性能。

English

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.