nablaNABLA: 이웃 적응형 블록 수준 어텐션

초록

트랜스포머 기반 아키텍처의 최근 발전은 비디오 생성 작업에서 놀라운 성공을 보여주고 있습니다. 그러나 전체 어텐션 메커니즘의 이차 복잡도는 여전히 중요한 병목 현상으로 남아 있으며, 특히 고해상도 및 장기간 비디오 시퀀스에서 더욱 두드러집니다. 본 논문에서는 비디오 디퓨전 트랜스포머(DiTs)의 희소성 패턴에 동적으로 적응하는 새로운 Neighborhood Adaptive Block-Level Attention(NABLA) 메커니즘을 제안합니다. NABLA는 적응형 희소성 기반 임계값을 사용한 블록 단위 어텐션을 활용하여 생성 품질을 유지하면서 계산 오버헤드를 줄입니다. 우리의 방법은 사용자 정의 저수준 연산자 설계가 필요하지 않으며, PyTorch의 Flex Attention 연산자와 원활하게 통합될 수 있습니다. 실험 결과, NABLA는 기준 모델 대비 최대 2.7배 빠른 학습 및 추론 속도를 달성하면서도 정량적 지표(CLIP 점수, VBench 점수, 인간 평가 점수)와 시각적 품질 저하를 거의 없앴습니다. 코드와 모델 가중치는 다음 링크에서 확인할 수 있습니다: https://github.com/gen-ai-team/Wan2.1-NABLA

English

Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA

nablaNABLA: 이웃 적응형 블록 수준 어텐션

nablaNABLA: Neighborhood Adaptive Block-Level Attention

초록

Support