SeerAttention-R: 장거리 추론을 위한 희소 주의력 적응

초록

우리는 추론 모델의 긴 디코딩을 위해 특별히 설계된 희소 어텐션 프레임워크인 SeerAttention-R을 소개한다. SeerAttention에서 확장된 SeerAttention-R은 자기-증류 게이팅 메커니즘을 통해 어텐션 희소성을 학습하는 설계를 유지하면서, 자동 회귀 디코딩을 수용하기 위해 쿼리 풀링을 제거했다. 경량 플러그인 게이팅을 통해 SeerAttention-R은 유연하며 기존의 사전 학습된 모델에 원래의 매개변수를 수정하지 않고도 쉽게 통합될 수 있다. 우리는 단 0.4B 토큰으로 학습된 SeerAttention-R이 AIME 벤치마크에서 4K 토큰 예산 내에서 큰 희소 어텐션 블록 크기(64/128)에서 거의 손실 없는 추론 정확도를 유지함을 보여준다. TileLang을 사용하여 우리는 H100 GPU에서 90% 희소성에서 FlashAttention-3 대비 이론적 속도 향상에 근접한 최대 9배의 속도 향상을 달성하는 고도로 최적화된 희소 디코딩 커널을 개발했다. 코드는 https://github.com/microsoft/SeerAttention에서 확인할 수 있다.

English

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

SeerAttention-R: 장거리 추론을 위한 희소 주의력 적응

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

초록

Support