SeerAttention-R：面向长程推理的稀疏注意力自适应机制

摘要

我們介紹了SeerAttention-R，這是一個專門為推理模型的長序列解碼設計的稀疏注意力框架。基於SeerAttention的擴展，SeerAttention-R保留了通過自蒸餾門控機制學習注意力稀疏性的設計，同時移除了查詢池化以適應自迴歸解碼。憑藉輕量級的插件式門控，SeerAttention-R具有靈活性，能夠輕鬆整合到現有的預訓練模型中，而無需修改原始參數。我們展示了在僅使用0.4B個token進行訓練的情況下，SeerAttention-R在AIME基準測試中，在4K token預算下，於大規模稀疏注意力塊大小（64/128）下保持了近乎無損的推理準確性。利用TileLang，我們開發了一個高度優化的稀疏解碼核心，在H100 GPU上，於90%稀疏度下，相比FlashAttention-3實現了接近理論值的高達9倍的加速。代碼可於以下網址獲取：https://github.com/microsoft/SeerAttention。

English

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

SeerAttention-R：面向长程推理的稀疏注意力自适应机制

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

摘要

Support