SeerAttention-R:面向长程推理的稀疏注意力自适应机制
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
June 10, 2025
作者: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang
cs.AI
摘要
我们推出SeerAttention-R,这是一个专为推理模型长序列解码设计的稀疏注意力框架。作为SeerAttention的扩展,SeerAttention-R保留了通过自蒸馏门控机制学习注意力稀疏性的设计,同时移除了查询池化以适应自回归解码。凭借轻量级的插件式门控,SeerAttention-R具备灵活性,能够在不改动原有参数的情况下,轻松集成到现有预训练模型中。我们展示,仅用0.4B个token训练,SeerAttention-R在AIME基准测试中,面对4K token预算及大尺寸稀疏注意力块(64/128)时,仍能保持近乎无损的推理精度。借助TileLang,我们开发了高度优化的稀疏解码内核,在H100 GPU上,于90%稀疏度下,相比FlashAttention-3实现了接近理论极限的9倍加速。代码已开源:https://github.com/microsoft/SeerAttention。
English
We introduce SeerAttention-R, a sparse attention framework specifically
tailored for the long decoding of reasoning models. Extended from
SeerAttention, SeerAttention-R retains the design of learning attention
sparsity through a self-distilled gating mechanism, while removing query
pooling to accommodate auto-regressive decoding. With a lightweight plug-in
gating, SeerAttention-R is flexible and can be easily integrated into existing
pretrained model without modifying the original parameters. We demonstrate that
SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning
accuracy with 4K token budget in AIME benchmark under large sparse attention
block sizes (64/128). Using TileLang, we develop a highly optimized sparse
decoding kernel that achieves near-theoretical speedups of up to 9x over
FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:
https://github.com/microsoft/SeerAttention.