ChatPaper.aiChatPaper

MoGA:面向端到端長視頻生成的混合群組注意力機制

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

October 21, 2025
作者: Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao
cs.AI

摘要

基於擴散變換器(DiTs)的長視頻生成,受制於全注意力機制隨序列長度呈二次方增長的瓶頸。由於注意力機制存在高度冗餘性,輸出結果主要由一小部分查詢-鍵值對主導。現有的稀疏方法依賴於分塊粗粒度估計,其精度與效率的權衡受到塊大小的限制。本文提出了一種高效的稀疏注意力機制——混合分組注意力(MoGA),它利用輕量級、可學習的令牌路由器精確匹配令牌,無需進行分塊估計。通過語義感知的路由機制,MoGA實現了有效的長距離交互。作為一種無核方法,MoGA能與現代注意力堆棧(如FlashAttention和序列並行)無縫集成。基於MoGA,我們開發了一種高效的長視頻生成模型,該模型端到端地生成分鐘級、多鏡頭、480p分辨率、24幀每秒的視頻,上下文長度約為580k。在多種視頻生成任務上的全面實驗驗證了我們方法的有效性。
English
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
PDF366October 22, 2025