OSP-Next: Efficiënte hoogwaardige videogeneratie met schaarse sequentieparallellisme, HiF8-kwantisatie en bekrachtigingsleren

Samenvatting

Diffusietransformatoren behalen sterke videogeneratiekwaliteit, maar de kwadratische kost van volledige aandacht beperkt de efficiëntie. We introduceren OSP-Next, een efficiënt tekst-naar-video generatiemodel dat schaarse aandacht, parallellisme, kwantisatie en reinforcement learning integreert. OSP-Next gebruikt een hybride volledige-schaarse aandachtarchitectuur, waarbij de schaarse component is geïmplementeerd met Skiparse-2D Attention. Dit vast-patroon mechanisme past token-wise en group-wise schaarse aandacht toe langs ruimtelijke dimensies, waarbij gebruik wordt gemaakt van localiteit terwijl native compatibiliteit met FlashAttention-kernels behouden blijft. Gebaseerd op de lokale equivalentie van herrangschikking in Skiparse-2D Attention, stellen we verder Sparse Sequence Parallelism (SSP) voor, dat deelreeksen over ranks verdeelt en schaarse patronen wisselt via een enkele All-to-All communicatie. Vergeleken met Ulysses Sequence Parallelism (SP) biedt SSP een native parallelle strategie voor schaarse aandacht en vermindert het communicatievolume met 75%. OSP-Next bevat ook HiF8-kwantisatie om stabiele gezamenlijke training met 8-bit kwantisatie en schaarse fine-tuning mogelijk te maken, en past Mix-GRPO post-training toe om de prestaties van het schaarse model te verbeteren. Experimenten tonen aan dat OSP-Next een VBench-totaalscore van 83,73% behaalt, waarmee het de Wan2.1-basislijn overtreft. Onder de instellingen voor 5-seconden 720P en 5-seconden 768P behaalt OSP-Next tot 1,64× single-GPU versnelling en meer dan 1,52× eight-GPU versnelling op NVIDIA H200 GPU's. Bovendien, met slechts een 0,4% daling in VBench-totaalscore, behaalt OSP-Next-HiF8 1,69× en 2,27× versnellingen onder de twee instellingen op een enkele Ascend 950PR, wat de efficiëntie en prestaties van OSP-Next over hardwareplatforms aantoont.

English

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.