SeerAttention-R：面向长程推理的稀疏注意力自适应机制

摘要

我们推出SeerAttention-R，这是一个专为推理模型长序列解码设计的稀疏注意力框架。作为SeerAttention的扩展，SeerAttention-R保留了通过自蒸馏门控机制学习注意力稀疏性的设计，同时移除了查询池化以适应自回归解码。凭借轻量级的插件式门控，SeerAttention-R具备灵活性，能够在不改动原有参数的情况下，轻松集成到现有预训练模型中。我们展示，仅用0.4B个token训练，SeerAttention-R在AIME基准测试中，面对4K token预算及大尺寸稀疏注意力块（64/128）时，仍能保持近乎无损的推理精度。借助TileLang，我们开发了高度优化的稀疏解码内核，在H100 GPU上，于90%稀疏度下，相比FlashAttention-3实现了接近理论极限的9倍加速。代码已开源：https://github.com/microsoft/SeerAttention。

English

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

SeerAttention-R：面向长程推理的稀疏注意力自适应机制

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

摘要

Support