nablaNABLA: Nachbarschaftsadaptive Block-Level-Aufmerksamkeit

papers.abstract

Jüngste Fortschritte in transformer-basierten Architekturen haben bemerkenswerte Erfolge bei Aufgaben der Videogenerierung gezeigt. Die quadratische Komplexität von vollständigen Aufmerksamkeitsmechanismen bleibt jedoch ein kritischer Engpass, insbesondere für hochauflösende und langandauernde Videosequenzen. In diesem Artikel stellen wir NABLA vor, einen neuartigen Neighborhood Adaptive Block-Level Attention-Mechanismus, der sich dynamisch an Sparsity-Muster in Video-Diffusion-Transformatoren (DiTs) anpasst. Durch die Nutzung von blockweiser Aufmerksamkeit mit einem adaptiven, sparsity-gesteuerten Schwellenwert reduziert NABLA den Rechenaufwand, während die generative Qualität erhalten bleibt. Unsere Methode erfordert kein spezielles Low-Level-Operator-Design und kann nahtlos mit PyTorchs Flex Attention-Operator integriert werden. Experimente zeigen, dass NABLA bis zu 2,7-mal schnellere Trainings- und Inferenzzeiten im Vergleich zur Baseline erreicht, fast ohne Einbußen bei quantitativen Metriken (CLIP-Score, VBench-Score, menschliche Bewertung) und visueller Qualität. Der Code und die Modellgewichte sind hier verfügbar: https://github.com/gen-ai-team/Wan2.1-NABLA.

English

Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA

nablaNABLA: Nachbarschaftsadaptive Block-Level-Aufmerksamkeit

nablaNABLA: Neighborhood Adaptive Block-Level Attention

papers.abstract

Support