nablaNABLA:鄰域自適應塊級注意力機制
nablaNABLA: Neighborhood Adaptive Block-Level Attention
July 17, 2025
作者: Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov
cs.AI
摘要
近期,基於Transformer架構的進展在視頻生成任務中展現了顯著的成功。然而,全注意力機制的二次方複雜度仍是一個關鍵瓶頸,特別是在高分辨率與長時間序列的視頻處理上。本文提出了一種新穎的鄰域自適應塊級注意力機制——NABLA,它能夠動態適應視頻擴散Transformer(DiTs)中的稀疏模式。通過利用帶有自適應稀疏驅動閾值的塊級注意力,NABLA在保持生成質量的同時,降低了計算開銷。我們的方法無需定製底層運算符設計,並能無縫集成於PyTorch的Flex Attention運算符中。實驗表明,與基線相比,NABLA在幾乎不影響定量指標(CLIP分數、VBench分數、人類評估分數)及視覺質量下降的情況下,實現了最高達2.7倍的訓練與推理速度提升。代碼及模型權重可在此獲取:https://github.com/gen-ai-team/Wan2.1-NABLA。
English
Recent progress in transformer-based architectures has demonstrated
remarkable success in video generation tasks. However, the quadratic complexity
of full attention mechanisms remains a critical bottleneck, particularly for
high-resolution and long-duration video sequences. In this paper, we propose
NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that
dynamically adapts to sparsity patterns in video diffusion transformers (DiTs).
By leveraging block-wise attention with adaptive sparsity-driven threshold,
NABLA reduces computational overhead while preserving generative quality. Our
method does not require custom low-level operator design and can be seamlessly
integrated with PyTorch's Flex Attention operator. Experiments demonstrate that
NABLA achieves up to 2.7x faster training and inference compared to baseline
almost without compromising quantitative metrics (CLIP score, VBench score,
human evaluation score) and visual quality drop. The code and model weights are
available here: https://github.com/gen-ai-team/Wan2.1-NABLA