고슴도치와 호저: 소프트맥스 모방을 통한 표현적 선형 어텐션

초록

선형 어텐션(linear attention)은 Transformer의 효율성을 개선하고, 어텐션의 이차 복잡도를 시퀀스 길이에 대한 선형 복잡도로 줄이는 잠재력을 보여주었다. 이는 (1) 처음부터 선형 Transformer를 학습시키는 것, (2) 특정 작업에 맞게 미세 조정된 Transformer를 작업 성능을 회복하는 선형 버전으로 "미세 조정 변환"하는 것, (3) 대규모 언어 모델과 같은 Transformer를 다운스트림 작업에 대해 미세 조정 가능한 선형 버전으로 "사전 학습 변환"하는 것에 대한 흥미로운 가능성을 제시한다. 그러나 선형 어텐션은 종종 표준 소프트맥스 어텐션(softmax attention)보다 품질 면에서 뒤처진다. 이러한 성능 격차를 해소하기 위해, 우리는 기존의 선형 어텐션이 좋은 성능과 연결된 소프트맥스 어텐션의 핵심 속성인 낮은 엔트로피(또는 "스파이키"한) 가중치와 내적 단조성(dot-product monotonicity)을 결여하고 있음을 발견했다. 또한, 이러한 속성을 유지하면서 소프트맥스 성능과 일치하지만 선형 어텐션에서는 계산 비효율적인 놀라울 정도로 간단한 특징 맵(feature maps)을 관찰했다. 따라서 우리는 소프트맥스 어텐션의 스파이키하고 단조로운 속성을 유지하면서 선형 복잡도를 유지하는 학습 가능한 선형 어텐션인 Hedgehog를 제안한다. Hedgehog는 간단한 학습 가능한 MLP를 사용하여 소프트맥스 어텐션을 모방하는 어텐션 가중치를 생성한다. 실험 결과, Hedgehog는 처음부터 학습시키는 설정과 미세 조정 변환 설정에서 표준 Transformer 품질의 99% 이상을 회복하며, WikiText-103에서 인과적 GPT 모델에 대해 기존 선형 어텐션보다 최대 6 퍼플렉서티(perplexity) 포인트, 미세 조정된 양방향 BERT 모델에 대해 최대 8.7 GLUE 점수를 앞섰다. Hedgehog는 또한 사전 학습 변환을 가능하게 한다. 사전 학습된 GPT-2를 선형 어텐션 변형으로 변환하면 WikiText-103에서 125M 서브쿼드라틱 디코더 모델에 대해 최신의 16.7 퍼플렉서티를 달성했다. 마지막으로, 사전 학습된 Llama-2 7B를 실행 가능한 선형 어텐션 Llama로 변환했다. 저순위 적응(low-rank adaptation)을 통해 Hedgehog-Llama2 7B는 기본 표준 어텐션 모델보다 28.1 ROUGE-1 점수를 더 높였으며, 기존 선형 어텐션은 16.5 점 하락을 초래했다.

English

Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as large language models into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.

고슴도치와 호저: 소프트맥스 모방을 통한 표현적 선형 어텐션

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

초록

Support