MoGA:面向端到端长视频生成的混合分组注意力机制
MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
October 21, 2025
作者: Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao
cs.AI
摘要
基于扩散变换器(DiTs)的长视频生成受限于全注意力机制随序列长度呈二次方扩展的问题。由于注意力机制存在高度冗余性,输出结果主要由一小部分查询-键值对主导。现有的稀疏方法依赖于分块粗粒度估计,其精度与效率的权衡受到块大小的限制。本文提出了一种高效的稀疏注意力机制——混合分组注意力(MoGA),它通过轻量级、可学习的令牌路由器精确匹配令牌,无需进行分块估计。借助语义感知路由,MoGA实现了有效的长程交互。作为一种无核方法,MoGA能够与现代注意力堆栈(包括FlashAttention和序列并行技术)无缝集成。基于MoGA,我们开发了一种高效的长视频生成模型,能够端到端地生成分钟级、多镜头、480p分辨率、24帧每秒的视频,上下文长度约为58万。在多种视频生成任务上的全面实验验证了我们方法的有效性。
English
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by
the quadratic scaling of full attention with sequence length. Since attention
is highly redundant, outputs are dominated by a small subset of query-key
pairs. Existing sparse methods rely on blockwise coarse estimation, whose
accuracy-efficiency trade-offs are constrained by block size. This paper
introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention
that uses a lightweight, learnable token router to precisely match tokens
without blockwise estimation. Through semantic-aware routing, MoGA enables
effective long-range interactions. As a kernel-free method, MoGA integrates
seamlessly with modern attention stacks, including FlashAttention and sequence
parallelism. Building on MoGA, we develop an efficient long video generation
model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps,
with a context length of approximately 580k. Comprehensive experiments on
various video generation tasks validate the effectiveness of our approach.