Sparser는 더 빠르고, 적은 것이 더 많다: 장거리 트랜스포머를 위한 효율적인 희소 어텐션

초록

자기회귀 트랜스포머에서 긴 시퀀스를 효율적으로 처리하는 것은, 특히 확장된 컨텍스트 윈도우 내에서, 자기 주의 메커니즘의 이차 계산 복잡성과 상당한 키-값(KV) 메모리 요구 사항으로 인해 상당한 어려움을 겪습니다. 본 연구에서는 이러한 계산 및 메모리 문제를 극복하면서도 성능을 유지하기 위해 새로운 희소 주의 메커니즘인 SPARSEK Attention을 소개합니다. 우리의 접근 방식은 각 쿼리에 대해 일정한 수의 KV 쌍을 선택하기 위해 스코어링 네트워크와 미분 가능한 top-k 마스크 연산자인 SPARSEK를 통합하여 그래디언트 기반 최적화를 가능하게 합니다. 결과적으로, SPARSEK Attention은 생성 과정에서 선형 시간 복잡성과 일정한 메모리 공간을 제공합니다. 실험 결과는 SPARSEK Attention이 기존의 희소 주의 방법들을 능가하며, 특히 언어 모델링 및 다운스트림 작업에서 학습 및 추론 속도를 크게 개선함을 보여줍니다. 또한, 우리의 방법은 최소한의 미세 조정만으로도 사전 훈련된 대형 언어 모델(LLM)에 원활하게 통합될 수 있어, 다양한 애플리케이션에서 장거리 의존성을 효과적으로 관리할 수 있는 실용적인 해결책을 제공합니다.

English

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

Sparser는 더 빠르고, 적은 것이 더 많다: 장거리 트랜스포머를 위한 효율적인 희소 어텐션

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

초록

Support