OSP-Next: スパース系列並列処理、HiF8量子化、および強化学習を用いた効率的な高品質動画生成

要旨

ディフュージョントランスフォーマーは高品質な動画生成を実現するが、フルアテンションの二次コストが効率を制限する。本稿では、スパースアテンション、並列処理、量子化、強化学習を統合した効率的なテキスト-to-動画生成モデルOSP-Nextを紹介する。OSP-Nextはハイブリッドなフルスパースアテンションアーキテクチャを採用し、スパース成分はSkiparse-2D Attentionで実装される。この固定パターンメカニズムは、空間次元に沿ってトークン単位およびグループ単位のスパースアテンションを適用し、局所性を活用しつつFlashAttentionカーネルとのネイティブ互換性を維持する。Skiparse-2D Attentionにおける再配置の局所等価性に基づき、サブシーケンスをランク間で分割し、単一のAll-to-All通信でスパースパターンを切り替えるSparse Sequence Parallelism (SSP)をさらに提案する。Ulysses Sequence Parallelism (SP)と比較して、SSPはスパースアテンションに対するネイティブな並列戦略を提供し、通信量を75%削減する。OSP-Nextはまた、8ビット量子化とスパースファインチューニングによる安定したジョイントトレーニングを可能にするHiF8量子化を組み込み、スパースモデルの性能向上のためにMix-GRPO後処理トレーニングを適用する。実験により、OSP-NextはVBench総合スコア83.73%を達成し、Wan2.1ベースラインを上回ることが示された。5秒720Pおよび5秒768P設定において、OSP-NextはNVIDIA H200 GPU上で最大1.64倍の単一GPU高速化と1.52倍以上の8GPU高速化を達成する。さらに、VBench総合スコアのわずか0.4%低下で、OSP-Next-HiF8は単一のAscend 950PR上で2つの設定において1.69倍および2.27倍の高速化を達成し、ハードウェアプラットフォームを横断したOSP-Nextの効率と性能を実証している。

English

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.