단순 선형 어텐션 언어 모델은 리콀-처리량 트레이드오프를 균형 있게 조절한다.

초록

최근 연구에 따르면, 어텐션 기반 언어 모델은 이전에 문맥에서 본 토큰을 기반으로 생성물을 만들어내는 능력인 리콜(recall)에서 뛰어난 성능을 보입니다. 그러나 어텐션 기반 모델의 효율성은 추론 과정에서 KV 캐시의 과도한 메모리 소비로 인해 병목 현상을 겪습니다. 본 연구에서는 리콜 성능을 저하시키지 않으면서 언어 모델의 효율성(예: 메모리 소비 감소)을 개선할 수 있는지 탐구합니다. 다양한 아키텍처에 대한 실험과 이론을 적용하여, 모델의 상태 크기와 리콜 능력 사이의 주요 트레이드오프를 확인했습니다. 어텐션의 효율적인 대안(예: H3, Mamba, RWKV)은 고정 크기의 순환 상태를 유지하지만 리콜에서 어려움을 겪는 것을 보여줍니다. 우리는 선형 어텐션과 슬라이딩 윈도우 어텐션을 결합한 간단한 아키텍처인 BASED를 제안합니다. BASED의 윈도우 크기와 선형 어텐션 특징 차원을 조정함으로써 상태 크기를 조절하고 리콜-메모리 트레이드오프 곡선의 파레토 프론티어를 탐색할 수 있습니다. 이를 통해 한쪽 끝에서는 어텐션의 완전한 품질을, 다른 쪽 끝에서는 어텐션 대안의 작은 상태 크기를 회복할 수 있습니다. 우리는 최대 13억 파라미터의 언어 모델을 학습시켜 BASED가 가장 강력한 서브-쿼드라틱 모델(예: Mamba)과 perplexity에서 동등한 성능을 보이며, 실제 세계의 리콜 집약적 작업에서는 6.22 정확도 포인트 더 우수한 성능을 보임을 입증했습니다. 선형 어텐션의 구현은 최적화된 표준 어텐션 구현보다 종종 덜 효율적입니다. BASED를 경쟁력 있게 만들기 위해, 우리는 IO 인식 알고리즘을 개발하여 13억 파라미터 모델을 사용해 1024 토큰을 생성할 때 FlashAttention-2보다 24배 높은 처리량을 달성했습니다. 본 연구의 코드는 https://github.com/HazyResearch/based에서 제공됩니다.

English

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.

단순 선형 어텐션 언어 모델은 리콀-처리량 트레이드오프를 균형 있게 조절한다.

Simple linear attention language models balance the recall-throughput tradeoff

초록

Support