OSP-Next: 희소 시퀀스 병렬 처리, HiF8 양자화, 강화 학습을 활용한 효율적인 고품질 비디오 생성

초록

확산 트랜스포머는 강력한 비디오 생성 품질을 제공하지만, 완전한 어텐션의 이차 비용이 효율성을 제한합니다. 본 논문에서는 희소 어텐션, 병렬 처리, 양자화 및 강화 학습을 통합한 효율적인 텍스트-비디오 생성 모델인 OSP-Next를 제안합니다. OSP-Next는 하이브리드 완전-희소 어텐션 아키텍처를 사용하며, 희소 구성 요소는 Skiparse-2D 어텐션으로 구현됩니다. 이 고정 패턴 메커니즘은 공간 차원을 따라 토큰 단위 및 그룹 단위 희소 어텐션을 적용하여, FlashAttention 커널과의 기본 호환성을 유지하면서 국소성을 활용합니다. Skiparse-2D 어텐션에서 재배열의 국소적 등가성을 바탕으로, 우리는 추가로 희소 시퀀스 병렬 처리(SSP)를 제안합니다. 이는 서브시퀀스를 여러 랭크에 분할하고 단일 All-to-All 통신을 통해 희소 패턴을 전환합니다. Ulysses 시퀀스 병렬 처리(SP)와 비교하여 SSP는 희소 어텐션에 대한 기본 병렬 전략을 제공하고 통신량을 75% 감소시킵니다. OSP-Next는 또한 HiF8 양자화를 통합하여 8비트 양자화 및 희소 미세 조정을 통한 안정적인 공동 학습을 가능하게 하고, Mix-GRPO 사후 학습을 적용하여 희소 모델의 성능을 향상시킵니다. 실험 결과, OSP-Next는 VBench 총점 83.73%를 달성하여 Wan2.1 기준선을 능가합니다. 5초 720P 및 5초 768P 설정에서 OSP-Next는 NVIDIA H200 GPU에서 최대 1.64배 단일 GPU 가속과 1.52배 이상의 8-GPU 가속을 달성합니다. 또한, VBench 총점이 0.4% 하락하는 데 그친 OSP-Next-HiF8은 단일 Ascend 950PR에서 두 설정 하에 각각 1.69배 및 2.27배의 가속을 보여, 다양한 하드웨어 플랫폼에서 OSP-Next의 효율성과 성능을 입증합니다.

English

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.