ChatPaper.aiChatPaper

SeerAttention-R:面向长程推理的稀疏注意力自适应机制

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

June 10, 2025
作者: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang
cs.AI

摘要

我們介紹了SeerAttention-R,這是一個專門為推理模型的長序列解碼設計的稀疏注意力框架。基於SeerAttention的擴展,SeerAttention-R保留了通過自蒸餾門控機制學習注意力稀疏性的設計,同時移除了查詢池化以適應自迴歸解碼。憑藉輕量級的插件式門控,SeerAttention-R具有靈活性,能夠輕鬆整合到現有的預訓練模型中,而無需修改原始參數。我們展示了在僅使用0.4B個token進行訓練的情況下,SeerAttention-R在AIME基準測試中,在4K token預算下,於大規模稀疏注意力塊大小(64/128)下保持了近乎無損的推理準確性。利用TileLang,我們開發了一個高度優化的稀疏解碼核心,在H100 GPU上,於90%稀疏度下,相比FlashAttention-3實現了接近理論值的高達9倍的加速。代碼可於以下網址獲取:https://github.com/microsoft/SeerAttention。
English
We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.
PDF222June 12, 2025