정규화된 희소 주의력

초록

효율적인 장문 시퀀스 생성은 대형 언어 모델(Large Language Models)의 중요한 과제입니다. 최근의 희소 디코딩 방법들이 효율성을 개선했지만, KV 캐시 불일치 문제로 인해 근사 오차가 누적되어 생성 품질이 저하되는 단점이 있습니다. 본 연구에서는 블록 희소 어텐션(block-sparse attention)과 주기적 밀집 보정(periodic dense rectification)을 결합한 간단하지만 효과적인 방법인 Rectified Sparse Attention(ReSA)을 제안합니다. ReSA는 고정된 간격으로 밀집 순전파(dense forward pass)를 사용하여 KV 캐시를 갱신함으로써 오차 누적을 제한하고 사전 학습 분포와의 정렬을 유지합니다. 수학적 추론, 언어 모델링, 검색 작업에 걸친 실험 결과, ReSA는 상당한 효율성 개선과 함께 거의 손실 없는 생성 품질을 달성함을 보여줍니다. 특히, ReSA는 256K 길이의 시퀀스 디코딩에서 최대 2.42배의 종단 간 속도 향상을 제공하여 확장 가능한 장문 컨텍스트 추론을 위한 실용적인 솔루션임을 입증했습니다. 코드는 https://aka.ms/ReSA-LM에서 확인할 수 있습니다.

English

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42times end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.